KRT Converter

Overview

The conversion process implements a two-stage methodology for transforming Key Resources Tables between Cell Press and ASAP formats. The process addresses structural differences between formats through parsing, mapping, and reconstruction algorithms while maintaining data integrity.

Cell Press to ASAP Conversion

Stage 1 - Structure Parsing: The algorithm analyzes the hierarchical Cell Press format by identifying category headers through pattern detection. Category headers are recognized when the first column contains data while subsequent columns (source and identifier) remain empty. Resource entries are associated with the most recent category header encountered during sequential processing. The parser handles optional title rows and variable header positioning through structural analysis of the first two data rows.

Stage 2 - ASAP Format Application: Parsed resources undergo transformation to ASAP standards through lookup tables that map Cell Press categories to ASAP resource types. NEW/REUSE designation is determined by source field analysis, where entries containing "This study" or "This paper" are classified as "new", with all others classified as "reuse". Output resources are ordered according to ASAP specifications: Dataset, Software/code, Protocol, Antibody, Bacterial strain, Viral vector, Biological sample, Chemical/peptide/recombinant protein, Critical commercial assay, Experimental model: Cell line, Experimental model: Organism/strain, Oligonucleotide, Recombinant DNA, Other.

ASAP to Cell Press Conversion

Stage 1 - Resource Categorization: The ASAP table structure is reorganized by grouping resources according to RESOURCE TYPE field values. Each ASAP resource type is mapped to the corresponding Cell Press category using predefined mapping tables. Resource metadata (source, identifier, additional information) is preserved during the grouping process for subsequent integration into the Cell Press format.

Stage 2 - Cell Press Structure Construction: Grouped resources are assembled into the Cell Press hierarchical format. Category headers are inserted according to standard Cell Press ordering, with associated resources listed beneath each header. The output follows the three-column structure: "REAGENT or RESOURCE", "SOURCE", "IDENTIFIER". Category header rows contain data only in the first column, conforming to Cell Press formatting conventions.

Mapping and Data Processing

Resource type mapping uses predefined dictionaries to handle format-specific terminology. Examples include "cell lines" to "Experimental model: Cell line", "chemicals" to "Chemical, peptide, or recombinant protein", and "deposited data" to "Dataset". Non-standard terminology is processed using fuzzy string matching based on character-level similarity and keyword analysis. Data integrity is maintained through validation checks at each processing stage.

Quality Control

Each conversion includes format compliance verification, completeness validation, and mapping accuracy assessment. Processing logs record all transformation decisions and identify resources that cannot be automatically categorized. Error handling mechanisms flag unmappable resources and provide specific guidance for manual resolution. Output validation confirms conformance to target format specifications.

Implementation Details

Data processing uses pandas DataFrame operations with UTF-8 encoding support. Preprocessing removes null-value rows and whitespace-only entries. CSV and Excel (.xlsx) input formats are supported with encoding detection in the following priority: UTF-8, Latin-1, CP1252, ISO-8859-1. Processing steps are logged for reproducibility and debugging purposes.