KRT Checker

Overview

The validation process implements a multi-stage methodology for assessing compliance with ASAP (Aligning Science Across Parkinson's) standards for Key Resources Tables. The process consists of automated data structure analysis, column mapping, and content validation to identify deviations from the required format specifications.

File Processing and Data Ingestion

The validation process begins with file format detection and encoding analysis. CSV and Excel (.xlsx) formats are supported, with encoding detection performed in the following order: UTF-8, Latin-1, CP1252, ISO-8859-1. For files containing mixed format structures, header detection analyzes the first two rows to identify the data starting position. Rows containing only null values or whitespace are removed during preprocessing.

Column Mapping

Column recognition uses case-insensitive string matching to map user-defined column names to the six required ASAP columns: RESOURCE TYPE, RESOURCE NAME, SOURCE, IDENTIFIER, NEW/REUSE, ADDITIONAL INFORMATION. The mapping algorithm handles common variations including underscore substitutions, hyphen variations, and abbreviated forms. Columns beyond the required set are identified and reported.

Resource Type Validation

Resource type validation compares entries against the ASAP-approved list of 14 standardized categories using exact case-insensitive matching. Non-conforming entries are flagged, and alternative matches are suggested using fuzzy string matching based on character similarity and keyword analysis. The algorithm accounts for pluralization variants, abbreviated forms, and commonly used alternative terminology.

Required Field Completeness Analysis

Completeness validation is enforced for four mandatory fields: RESOURCE TYPE, RESOURCE NAME, IDENTIFIER, and NEW/REUSE. The analysis identifies multiple forms of missing data including null values, "N/A" entries, whitespace-only cells, and common placeholder strings. For the IDENTIFIER field, when formal identifiers are unavailable, the system accepts the specific string "No identifier exists" to maintain data completeness requirements.

Data Availability Assessment

The system analyzes resource entries to identify new datasets and software/code resources based on RESOURCE TYPE and NEW/REUSE field combinations. When new datasets are absent, a warning is generated with template language for Data Availability Statements. Similarly, when new software/code resources are not present, appropriate template language is provided for Code Availability Statements in accordance with ASAP reproducibility requirements.

Output and Reporting

Validation results are categorized into three types: errors (compliance failures), warnings (recommended improvements), and successes (conforming elements). Error messages include row-specific references and descriptions of the non-compliance issue. The validation process generates a complete log of all processing steps and decisions made during analysis.