chromtrace: Raw chromatogram trace file parser

This module handles the parsing of raw traces present in chromatography files, whether the source is a liquid chromatograph (LC) or a gas chromatograph (GC). The basic function of the parser is to:

  1. read in the raw data and create timestamped traces

  2. collect metadata such as the method information, sample ID, etc.

chromtrace loads the chromatographic data from the specified file, determines the uncertainties of the signal (y-axis), and explicitly populates the points in the time axis (x-axis), when required.

Usage

Available since yadg-4.0. The parser supports the following parameters:

pydantic model dgbowl_schemas.yadg.dataschema_4_2.step.ChromTrace.Params

Show JSON schema
{
   "title": "Params",
   "type": "object",
   "properties": {
      "filetype": {
         "title": "Filetype",
         "default": "ezchrom.asc",
         "enum": [
            "ezchrom.asc",
            "fusion.json",
            "fusion.zip",
            "agilent.ch",
            "agilent.dx",
            "agilent.csv"
         ],
         "type": "string"
      },
      "calfile": {
         "title": "Calfile",
         "deprecated": true,
         "type": "string"
      },
      "species": {
         "title": "Species",
         "deprecated": true
      },
      "detectors": {
         "title": "Detectors",
         "deprecated": true
      }
   },
   "additionalProperties": false
}

field filetype: Literal['ezchrom.asc', 'fusion.json', 'fusion.zip', 'agilent.ch', 'agilent.dx', 'agilent.csv'] = 'ezchrom.asc'
field calfile: Optional[str] = None

Species calibration specification.

DEPRECATED in DataSchema-4.2

This feature is deprecated as of yadg-4.2 and will stop working in yadg-5.0.

field species: Optional[Any] = None

Species information as a dict.

DEPRECATED in DataSchema-4.2

This feature is deprecated as of yadg-4.2 and will stop working in yadg-5.0.

field detectors: Optional[Any] = None

Detector integration parameters as a dict.

DEPRECATED in DataSchema-4.2

This feature is deprecated as of yadg-4.2 and will stop working in yadg-5.0.

DEPRECATED in yadg-4.2

The calfile, detectors and species parameters are deprecated as of yadg-4.2 and will stop working in yadg-5.0.

Formats

The filetypes currently supported by the parser are:

  • EZ-Chrom ASCII export (ezchrom.asc): see ezchromasc

  • Agilent Chemstation Chromtab (agilent.csv): see agilentcsv

  • Agilent OpenLab binary signal (agilent.ch): see agilentch

  • Agilent OpenLab data archive (agilent.dx): see agilentdx

  • Inficon Fusion JSON format (fusion.json): see fusionjson

  • Inficon Fusion zip archive (fusion.zip): see fusionzip

Provides

The raw data is stored, for each timestep, using the following format:

- uts: !!float
  fn:  !!str
  raw:
    traces:
      "{{ trace_name }}":        # detector name from the raw data file
        id:               !!int  # detector id for matching with calibration data
        t:                       # time-axis units are always seconds
          {n: [!!float, ...], s: [!!float, ...], u: "s"}
        y:                       # y-axis units are determined from raw file
          {n: [!!float, ...], s: [!!float, ...], u: !!str}

Note

To parse processed data in the raw data files, such as integrated peak areas or concentrations, use the chromdata parser instead.

DEPRECATED in yadg-4.2

The below functionality has been deprecated in yadg-4.2 and will stop working in yadg-5.0.

The data processing performed by chromtrace is enabled automatically when calibration information is provided. The resulting data is stored in the derived entry in each timestep, and contains the following information:

- derived:
    peaks:
      "{{ trace_name }}":     # detector name from raw data file
        "{{ species_name }}": # species name matched from calibration
          peak:
            max:      !!int   # index of peak maximum
            llim:     !!int   # index of peak left limit
            rlim:     !!int   # index of peak right limit
          A:                  # integrated peak area
            {n: !!float, s: !!float, u: !!str}
          h:                  # height of peak maximum
            {n: !!float, s: !!float, u: !!str}
          c:                  # calibrated "concentration" or other quantity
            {n: !!float, s: !!float, u: !!str}
    height:                   # baseline-corrected height of peak maximum
      "{{ species_name }}":
          {n: !!float, s: !!float, u: !!str}
    area:                     # integrated area of the sample peak
      "{{ species_name }}":
          {n: !!float, s: !!float, u: !!str}
    concentration:            # concentration of species derived from area
      "{{ species_name }}":
          {n: !!float, s: !!float, u: !!str}
    xout:                     # normalised mol fractions of species
      "{{ species_name }}":
          {n: !!float, s: !!float, u: !!str}

Note

The specification of dictionaries that ought to be passed to species and detectors (or stored as json in "calfile") is described in yadg.parsers.chromtrace.main.parse_detector_spec().

Note

The quantity c, determined for each integrated peak, may not necessarily be concentration. It can also be mole fraction, as it is determined from the peak area in A and any provided calibration specification. The calibration interface allows for units to be supplied.

Note

The mol fractions in xout always sum up to unity. If there is more than one outlet stream, these mol fractions have to be weighted by the flow rate in a post-processing routine.

Submodules

agilentch: Processing Agilent OpenLab binary signal trace files (CH and IT).

Currently supports version “179” of the files. Version information is defined in the magic_values (parameters & metadata) and data_dtypes (data) dictionaries.

Adapted from ImportAgilent.m and aston.

Exposed metadata:

params:
  method:   !!str
  sampleid: !!str
  username: !!str
  version:  !!str
  datafile: None

File Structure of .ch files

0x0000 "version magic"
0x0108 "data offset"
0x011a "x-axis minimum (ms)"
0x011e "x-axis maximum (ms)"
0x035a "sample ID"
0x0559 "description"
0x0758 "username"
0x0957 "timestamp"
0x09e5 "instrument name"
0x09bc "inlet"
0x0a0e "method"
0x104c "y-axis unit"
0x1075 "detector name"
0x1274 "y-axis intercept"
0x127c "y-axis slope"

Data is stored in a consecutive set of <f8, starting at the offset (calculated as offset =  ("data offset" - 1) * 512) until the end of the file.

Code author: Peter Kraus <peter.kraus@empa.ch>

yadg.parsers.chromtrace.agilentch.process(fn, encoding, timezone)

Agilent OpenLAB signal trace parser

One chromatogram per file with a single trace. Binary data format.

Parameters
  • fn (str) – Filename to process.

  • encoding (str) – Not used as the file is binary.

  • timezone (str) – Timezone information. This should be "localtime".

Returns

([chrom], metadata) – Standard timesteps & metadata tuple.

Return type

tuple[list, dict]

agilentcsv: Processing Agilent Chemstation Chromtab tabulated data files (csv).

This file format may include more than one timestep in each CSV file. It contains a header section for each timestep, followed by a detector name, and a sequence of [X, Y] datapoints.

Exposed metadata:

params:
  method:   None
  sampleid: !!str
  username: None
  version:  None
  datafile: !!str

Unfortunately, neither method nor version are exposed, which is a big weakness of this file format.

Code author: Peter Kraus <peter.kraus.empa.ch>

yadg.parsers.chromtrace.agilentcsv.process(fn, encoding, timezone)

Agilent Chemstation CSV (Chromtab) file parser

Each file may contain multiple chromatograms per file with multiple traces. Each chromatogram starts with a header section, and is followed by each trace, which includes a header line and x,y-data.

Parameters
  • fn (str) – Filename to process.

  • encoding (str) – Encoding used to open the file.

  • timezone (str) – Timezone information. This should be "localtime".

Returns

(chroms, metadata) – Standard timesteps & metadata tuple.

Return type

tuple[list, dict]

agilentch: Processing Agilent OpenLab data archive files (DX).

This is a wrapper parser which unzips the provided DX file, and then uses the yadg.parsers.chromtrace.agilentch parser to parse every CH file present in the archive. The IT files in the archive are currently ignored.

Exposed metadata:

params:
  method:   !!str
  sampleid: !!str
  username: !!str
  version:  !!str
  datafile: !!str

In addition to the metadata exposed by the CH parser, the datafile entry is populated with the corresponding name of the CH file. The fn entry in each timestep contains the parent DX file.

Note

Currently the timesteps from multiple CH files (if present) are appended in the timesteps array without any further sorting.

Code author: Peter Kraus

yadg.parsers.chromtrace.agilentdx.process(fn, encoding, timezone)

Agilent OpenLab DX archive parser.

This is a simple wrapper around the Agilent OpenLab signal trace parser in yadg.parsers.chromtrace.agilentch. This wrapper first un-zips the DX file into a temporary directory, and then processess all CH files found within the archive, concatenating timesteps from multiple files.

Parameters
  • fn (str) – Filename to process.

  • encoding (str) – Not used as the file is binary.

  • timezone (str) – Timezone information. This should be "localtime".

Returns

(chroms, metadata) – Standard timesteps & metadata tuple.

Return type

tuple[list, dict]

ezchromasc: Processing EZ-Chrom ASCII export files (dat.asc).

This file format includes one timestep with multiple traces in each ASCII file. It contains a header section, and a sequence of Y datapoints for each detector. The X axis is uniform between traces, and its units have to be deduced from the header.

Exposed metadata:

params:
  method:   !!str
  sampleid: !!str
  username: !!str
  version:  !!str
  datafile: !!str

Code author: Peter Kraus <peter.kraus@empa.ch>

yadg.parsers.chromtrace.ezchromasc.process(fn, encoding, timezone)

EZ-Chrome ASCII export file parser.

One chromatogram per file with multiple traces. A header section is followed by y-values for each trace. x-values have to be deduced using number of points, frequency, and x-multiplier. Method name is available, but detector names are not. They are assigned their numerical index in the file.

Parameters
  • fn (str) – Filename to process.

  • encoding (str) – Encoding used to open the file.

  • timezone (str) – Timezone information. This should be "localtime".

Returns

([chrom], metadata) – Standard timesteps & metadata tuple.

Return type

tuple[list, dict]

fusionjson: Processing Inficon Fusion json data format (json).

This is a fairly detailed data format, including the traces, the calibration applied, and also the integrated peak areas. If the peak areas are present, this is returned in the list of timesteps as a "peaks" entry.

Note

The detectors in the trace data are not necessarily in a consistent order, which may change between different files. Hence, the keys are sorted.

Exposed metadata:

params:
  method:   !!str
  sampleid: !!str
  username: None
  version:  !!str
  datafile: !!str

Code author: Peter Kraus

yadg.parsers.chromtrace.fusionjson.process(fn, encoding, timezone)

Fusion json format.

One chromatogram per file with multiple traces, and integrated peak areas.

Warning

To parse the integrated data present in these files, use the chromdata parser.

Only a subset of the metadata is retained, including the method name, detector names, and information about assigned peaks.

Parameters
  • fn (str) – Filename to process.

  • encoding (str) – Encoding used to open the file.

  • timezone (str) – Timezone information. This should be "localtime".

Returns

([chrom], metadata) – Standard timesteps & metadata tuple.

Return type

tuple[list, dict]

fusionzip: Processing Inficon Fusion zipped data format (zip).

This is a wrapper parser which unzips the provided zip file, and then uses the yadg.parsers.chromtrace.fusionjson parser to parse every data file present in the archive.

Exposed metadata:

params:
  method:   !!str
  sampleid: !!str
  username: None
  version:  !!str
  datafile: !!str

Code author: Peter Kraus

yadg.parsers.chromtrace.fusionzip.process(fn, encoding, timezone)

Fusion zip file format.

The Fusion GC’s can export their json formats as a zip archive of a folder of jsons. This parser allows for parsing of this zip archive directly, without the user having to unzip & move the data.

Parameters
  • fn (str) – Filename to process.

  • encoding (str) – Not used as the file is binary.

  • timezone (str) – Timezone information. This should be "localtime".

Returns

(chroms, metadata) – Standard timesteps & metadata tuple.

Return type

tuple[list, dict]

integration: Routines for chromatogram integration.

This module contains the integrate_trace() function, as well as several helper functions to smoothen, peak-pick, determine edges, and integrate the supplied traces.

Smoothing

Smoothing can be optionally performed on the Y-values of each trace, using a Savigny-Golay filter. The default smoothing is performed using a cubic fit to a window length of 7; if the polyorder or the window length are not specified, smoothing is not used.

Peak-picking and edge-finding

Peak-picking is performed on the smoothed Y-data to find peaks, as well as on the mirror image of the data to find bands. Only peaks are further processed. Additionally, the 1st and 2nd derivatives of the Y-data are evaluated, and the zero-points are found using numpy routines.

The peak edges are taken as either the nearest minima adjacent to the peak maximum, or as the inflection points at which the gradient falls below a prescribed threshold, whichever is closest to the peak maximum.

Baseline correction

Using the determined peak-edges, the baseline is linearly interpolated in sections of Y-data which belong to a peak. The interpolation is performed using the raw (not smoothened) Y-data.

If multiple peaks are adjacent to each other without a gap, the interpolation begins at the left limit of the leftmost peak and continues uninterrupted to the right limit of the rightmost peak. The points which belong to the interpolated areas are assumed to have an uncertainty of zero.

The corrected baseline is then obtained by subtracting the interpolated baseline from the original raw (not smoothened) data.

Peak integration

Peak integration is performed on the corrected baseline and the matching X-data using the trapezoidal method as implemented in np.trapz.

Code author: Peter Kraus <peter.kraus@empa.ch>

yadg.parsers.chromtrace.integration.integrate_trace(traces, chromspec)

Integration, calibration, and normalisation handling function. Used to process all chromatographic data for which a calibration has been provided

Parameters
  • traces (dict) – A dictionary of trace data, with keys being the “raw” name of the detector, and the values containing the "id" for specification matching, and a "data" tuple containing the (xs, ys) where each element is a pair of (np.ndarray) with the nominal values and standard deviations.

  • chromspec (dict) – Parsed calibration information, with keys being the detector names in the calibration file, and values containing the "id" for detector matching, "peakdetect" dictionary with peak-picking and edge-finding settings, and "species" dictionary with names of species as keys and the left, right limits and calibration information as values.

Returns

derived – A dictionary containing the derived data, including the peak heights, areas, concentrations, mol fractions, and a dictionary with the peak picking information (name, maximum, limits, height, area).

Return type

dict[dict]

yadg.parsers.chromtrace.main.parse_detector_spec(calfile=None, detectors=None, species=None)

Chromatography detector parser.

Combines the specification provided in calfile with that provided in detectors and species.

The format of calfile is as follows:

"{{ detector_name }}":    # name of the detector
  id:           !!int     # ID of the detector used for matching
  prefer:       !!bool    # whether to prefer this detector for xout calc
  peakdetect:
    window:     !!int     # Savigny-Golay window_length = 2*window + 1
    polyorder:  !!int     # Savigny-Golay polyorder
    prominence: !!float   # peak picking prominence parameter
    threshold:  !!float   # peak edge detection threshold
  species:
    "{{ species_name }}": # name of the analyte
      l:        !!float   # peak picking left limit [s]
      r:        !!float   # peak picking right limit [s]
      calib:    {}        # calibration specification
      unit:     !!str     # optional unit for the concentration, by default %

Note

The syntax of the calibration specification is detailed in yadg.dgutils.calib.calib_handler().

The format of detectors is as follows:

"{{ detector_name }}":  # name of the detector
  id:           !!int   # ID of the detector used for matching
  prefer:       !!bool  # whether to prefer this detector for xout calc
  peakdetect:
    window:     !!int   # Savigny-Golay window_length = 2*window + 1
    polyorder:  !!int   # Savigny-Golay polyorder
    prominence: !!float # peak picking prominence parameter
    threshold:  !!float # peak edge detection threshold

The format of species is as follows:

"{{ detector_name }}":    # name of the detector
  species:
    "{{ species_name }}": # name of the analyte
      l:        !!float   # peak picking left limit [s]
      r:        !!float   # peak picking right limit [s]
      calib:    !!calib   # calibration specification

Note

The syntax of the calibration specification is detailed in yadg.dgutils.calib.calib_handler().

Parameters
  • calfile (Optional[str]) – A json file containing the calibration data in the format prescribed above.

  • detectors (Optional[dict]) – A dictionary containing the "id", "peakdetect" and "prefer" keys for each detector, as shown here.

  • species (Optional[dict]) – A dictionary containing the species names as keys and their specification as dictionaries, as shown here.

Returns

calib – The combined calibration specification.

Return type

dict

yadg.parsers.chromtrace.main.process(fn, encoding='utf-8', timezone='localtime', parameters=None)

Unified raw chromatogram parser.

This parser processes GC and LC chromatograms in signal(time) format. When provided with a calibration file, this tool will integrate the trace, and provide the peak areas, retention times, and concentrations of the detected species.

Parameters
  • fn (str) – The file containing the trace(s) to parse.

  • encoding (str) – Encoding of fn, by default “utf-8”.

  • timezone (str) – A string description of the timezone. Default is “localtime”.

  • parameters (Optional[BaseModel]) – Parameters for ChromTrace.

Returns

(data, metadata, fulldate) – Tuple containing the timesteps, metadata, and full date tag. All currently supported file formats return full date.

Return type

tuple[list, dict, bool]