chromtrace: Raw chromatogram trace file parser
This module handles the parsing of raw traces present in chromatography files, whether the source is a liquid chromatograph (LC) or a gas chromatograph (GC). The basic function of the parser is to:
read in the raw data and create timestamped traces
collect metadata such as the method information, sample ID, etc.
chromtrace
loads the chromatographic data from the specified
file, determines the uncertainties of the signal (y-axis), and explicitly
populates the points in the time axis (x-axis), when required.
Usage
Available since yadg-4.0
. The parser supports the following parameters:
- pydantic model dgbowl_schemas.yadg.dataschema_4_2.step.ChromTrace.Params
Show JSON schema
{ "title": "Params", "type": "object", "properties": { "filetype": { "title": "Filetype", "default": "ezchrom.asc", "enum": [ "ezchrom.asc", "fusion.json", "fusion.zip", "agilent.ch", "agilent.dx", "agilent.csv" ], "type": "string" }, "calfile": { "title": "Calfile", "deprecated": true, "type": "string" }, "species": { "title": "Species", "deprecated": true }, "detectors": { "title": "Detectors", "deprecated": true } }, "additionalProperties": false }
- field filetype: Literal['ezchrom.asc', 'fusion.json', 'fusion.zip', 'agilent.ch', 'agilent.dx', 'agilent.csv'] = 'ezchrom.asc'
- field calfile: Optional[str] = None
Species calibration specification.
DEPRECATED in
DataSchema-4.2
This feature is deprecated as of
yadg-4.2
and will stop working inyadg-5.0
.
- field species: Optional[Any] = None
Species information as a
dict
.DEPRECATED in
DataSchema-4.2
This feature is deprecated as of
yadg-4.2
and will stop working inyadg-5.0
.
- field detectors: Optional[Any] = None
Detector integration parameters as a
dict
.DEPRECATED in
DataSchema-4.2
This feature is deprecated as of
yadg-4.2
and will stop working inyadg-5.0
.
DEPRECATED in yadg-4.2
The calfile
, detectors
and species
parameters are deprecated
as of yadg-4.2
and will stop working in yadg-5.0
.
Formats
The filetypes
currently supported by the parser are:
EZ-Chrom ASCII export (
ezchrom.asc
): seeezchromasc
Agilent Chemstation Chromtab (
agilent.csv
): seeagilentcsv
Agilent OpenLab binary signal (
agilent.ch
): seeagilentch
Agilent OpenLab data archive (
agilent.dx
): seeagilentdx
Inficon Fusion JSON format (
fusion.json
): seefusionjson
Inficon Fusion zip archive (
fusion.zip
): seefusionzip
Provides
The raw data is stored, for each timestep, using the following format:
- uts: !!float
fn: !!str
raw:
traces:
"{{ trace_name }}": # detector name from the raw data file
id: !!int # detector id for matching with calibration data
t: # time-axis units are always seconds
{n: [!!float, ...], s: [!!float, ...], u: "s"}
y: # y-axis units are determined from raw file
{n: [!!float, ...], s: [!!float, ...], u: !!str}
Note
To parse processed data in the raw data files, such as integrated peak areas or
concentrations, use the chromdata
parser instead.
DEPRECATED in yadg-4.2
The below functionality has been deprecated in yadg-4.2
and will stop working
in yadg-5.0
.
The data processing performed by chromtrace
is enabled
automatically when calibration information is provided. The resulting data is stored
in the derived
entry in each timestep, and contains the following information:
- derived:
peaks:
"{{ trace_name }}": # detector name from raw data file
"{{ species_name }}": # species name matched from calibration
peak:
max: !!int # index of peak maximum
llim: !!int # index of peak left limit
rlim: !!int # index of peak right limit
A: # integrated peak area
{n: !!float, s: !!float, u: !!str}
h: # height of peak maximum
{n: !!float, s: !!float, u: !!str}
c: # calibrated "concentration" or other quantity
{n: !!float, s: !!float, u: !!str}
height: # baseline-corrected height of peak maximum
"{{ species_name }}":
{n: !!float, s: !!float, u: !!str}
area: # integrated area of the sample peak
"{{ species_name }}":
{n: !!float, s: !!float, u: !!str}
concentration: # concentration of species derived from area
"{{ species_name }}":
{n: !!float, s: !!float, u: !!str}
xout: # normalised mol fractions of species
"{{ species_name }}":
{n: !!float, s: !!float, u: !!str}
Note
The specification of dictionaries that ought to be passed to species
and
detectors
(or stored as json in "calfile"
) is described in
yadg.parsers.chromtrace.main.parse_detector_spec()
.
Note
The quantity c
, determined for each integrated peak, may not necessarily
be concentration. It can also be mole fraction, as it is determined from the
peak area in A
and any provided calibration specification. The calibration
interface allows for units to be supplied.
Note
The mol fractions in xout
always sum up to unity. If there is more than
one outlet stream, these mol fractions have to be weighted by the flow rate
in a post-processing routine.
Submodules
agilentch: Processing Agilent OpenLab binary signal trace files (CH and IT).
Currently supports version “179” of the files. Version information is defined in the magic_values (parameters & metadata) and data_dtypes (data) dictionaries.
Adapted from ImportAgilent.m and aston.
Exposed metadata:
params:
method: !!str
sampleid: !!str
username: !!str
version: !!str
datafile: None
File Structure of .ch
files
0x0000 "version magic"
0x0108 "data offset"
0x011a "x-axis minimum (ms)"
0x011e "x-axis maximum (ms)"
0x035a "sample ID"
0x0559 "description"
0x0758 "username"
0x0957 "timestamp"
0x09e5 "instrument name"
0x09bc "inlet"
0x0a0e "method"
0x104c "y-axis unit"
0x1075 "detector name"
0x1274 "y-axis intercept"
0x127c "y-axis slope"
Data is stored in a consecutive set of <f8
, starting at the offset (calculated
as offset = ("data offset" - 1) * 512
) until the end of the file.
Code author: Peter Kraus <peter.kraus@empa.ch>
- yadg.parsers.chromtrace.agilentch.process(fn, encoding, timezone)
Agilent OpenLAB signal trace parser
One chromatogram per file with a single trace. Binary data format.
- Parameters
fn (
str
) – Filename to process.encoding (
str
) – Not used as the file is binary.timezone (
str
) – Timezone information. This should be"localtime"
.
- Returns
([chrom], metadata) – Standard timesteps & metadata tuple.
- Return type
tuple[list, dict]
agilentcsv: Processing Agilent Chemstation Chromtab tabulated data files (csv).
This file format may include more than one timestep in each CSV file. It contains
a header section for each timestep, followed by a detector name, and a sequence of
[X, Y]
datapoints.
Exposed metadata:
params:
method: None
sampleid: !!str
username: None
version: None
datafile: !!str
Unfortunately, neither method
nor version
are exposed, which is a big weakness
of this file format.
Code author: Peter Kraus <peter.kraus.empa.ch>
- yadg.parsers.chromtrace.agilentcsv.process(fn, encoding, timezone)
Agilent Chemstation CSV (Chromtab) file parser
Each file may contain multiple chromatograms per file with multiple traces. Each chromatogram starts with a header section, and is followed by each trace, which includes a header line and x,y-data.
- Parameters
fn (
str
) – Filename to process.encoding (
str
) – Encoding used to open the file.timezone (
str
) – Timezone information. This should be"localtime"
.
- Returns
(chroms, metadata) – Standard timesteps & metadata tuple.
- Return type
tuple[list, dict]
agilentch: Processing Agilent OpenLab data archive files (DX).
This is a wrapper parser which unzips the provided DX file, and then uses the
yadg.parsers.chromtrace.agilentch
parser to parse every CH file present in
the archive. The IT files in the archive are currently ignored.
Exposed metadata:
params:
method: !!str
sampleid: !!str
username: !!str
version: !!str
datafile: !!str
In addition to the metadata exposed by the CH parser, the datafile
entry
is populated with the corresponding name of the CH file. The fn
entry in each
timestep contains the parent DX file.
Note
Currently the timesteps from multiple CH files (if present) are appended in the timesteps array without any further sorting.
Code author: Peter Kraus
- yadg.parsers.chromtrace.agilentdx.process(fn, encoding, timezone)
Agilent OpenLab DX archive parser.
This is a simple wrapper around the Agilent OpenLab signal trace parser in
yadg.parsers.chromtrace.agilentch
. This wrapper first un-zips the DX file into a temporary directory, and then processess all CH files found within the archive, concatenating timesteps from multiple files.- Parameters
fn (
str
) – Filename to process.encoding (
str
) – Not used as the file is binary.timezone (
str
) – Timezone information. This should be"localtime"
.
- Returns
(chroms, metadata) – Standard timesteps & metadata tuple.
- Return type
tuple[list, dict]
ezchromasc: Processing EZ-Chrom ASCII export files (dat.asc).
This file format includes one timestep with multiple traces in each ASCII file. It contains a header section, and a sequence of Y datapoints for each detector. The X axis is uniform between traces, and its units have to be deduced from the header.
Exposed metadata:
params:
method: !!str
sampleid: !!str
username: !!str
version: !!str
datafile: !!str
Code author: Peter Kraus <peter.kraus@empa.ch>
- yadg.parsers.chromtrace.ezchromasc.process(fn, encoding, timezone)
EZ-Chrome ASCII export file parser.
One chromatogram per file with multiple traces. A header section is followed by y-values for each trace. x-values have to be deduced using number of points, frequency, and x-multiplier. Method name is available, but detector names are not. They are assigned their numerical index in the file.
- Parameters
fn (
str
) – Filename to process.encoding (
str
) – Encoding used to open the file.timezone (
str
) – Timezone information. This should be"localtime"
.
- Returns
([chrom], metadata) – Standard timesteps & metadata tuple.
- Return type
tuple[list, dict]
fusionjson: Processing Inficon Fusion json data format (json).
This is a fairly detailed data format, including the traces, the calibration applied,
and also the integrated peak areas. If the peak areas are present, this is returned
in the list of timesteps as a "peaks"
entry.
Note
The detectors in the trace data are not necessarily in a consistent order, which may change between different files. Hence, the keys are sorted.
Exposed metadata:
params:
method: !!str
sampleid: !!str
username: None
version: !!str
datafile: !!str
Code author: Peter Kraus
- yadg.parsers.chromtrace.fusionjson.process(fn, encoding, timezone)
Fusion json format.
One chromatogram per file with multiple traces, and integrated peak areas.
Warning
To parse the integrated data present in these files, use the
chromdata
parser.Only a subset of the metadata is retained, including the method name, detector names, and information about assigned peaks.
- Parameters
fn (
str
) – Filename to process.encoding (
str
) – Encoding used to open the file.timezone (
str
) – Timezone information. This should be"localtime"
.
- Returns
([chrom], metadata) – Standard timesteps & metadata tuple.
- Return type
tuple[list, dict]
fusionzip: Processing Inficon Fusion zipped data format (zip).
This is a wrapper parser which unzips the provided zip file, and then uses
the yadg.parsers.chromtrace.fusionjson
parser to parse every data
file present in the archive.
Exposed metadata:
params:
method: !!str
sampleid: !!str
username: None
version: !!str
datafile: !!str
Code author: Peter Kraus
- yadg.parsers.chromtrace.fusionzip.process(fn, encoding, timezone)
Fusion zip file format.
The Fusion GC’s can export their json formats as a zip archive of a folder of jsons. This parser allows for parsing of this zip archive directly, without the user having to unzip & move the data.
- Parameters
fn (
str
) – Filename to process.encoding (
str
) – Not used as the file is binary.timezone (
str
) – Timezone information. This should be"localtime"
.
- Returns
(chroms, metadata) – Standard timesteps & metadata tuple.
- Return type
tuple[list, dict]
integration: Routines for chromatogram integration.
This module contains the integrate_trace()
function, as well as several helper functions to smoothen, peak-pick, determine edges,
and integrate the supplied traces.
Smoothing
Smoothing can be optionally performed on the Y-values of each trace, using a Savigny-Golay filter. The default smoothing is performed using a cubic fit to a window length of 7; if the polyorder or the window length are not specified, smoothing is not used.
Peak-picking and edge-finding
Peak-picking is performed on the smoothed Y-data to find peaks, as well as on the mirror image of the data to find bands. Only peaks are further processed. Additionally, the 1st and 2nd derivatives of the Y-data are evaluated, and the zero-points are found using numpy routines.
The peak edges are taken as either the nearest minima adjacent to the peak maximum, or as the inflection points at which the gradient falls below a prescribed threshold, whichever is closest to the peak maximum.
Baseline correction
Using the determined peak-edges, the baseline is linearly interpolated in sections of Y-data which belong to a peak. The interpolation is performed using the raw (not smoothened) Y-data.
If multiple peaks are adjacent to each other without a gap, the interpolation begins at the left limit of the leftmost peak and continues uninterrupted to the right limit of the rightmost peak. The points which belong to the interpolated areas are assumed to have an uncertainty of zero.
The corrected baseline is then obtained by subtracting the interpolated baseline from the original raw (not smoothened) data.
Peak integration
Peak integration is performed on the corrected baseline and the matching X-data using
the trapezoidal method as implemented in np.trapz
.
Code author: Peter Kraus <peter.kraus@empa.ch>
- yadg.parsers.chromtrace.integration.integrate_trace(traces, chromspec)
Integration, calibration, and normalisation handling function. Used to process all chromatographic data for which a calibration has been provided
- Parameters
traces (
dict
) – A dictionary of trace data, with keys being the “raw” name of the detector, and the values containing the"id"
for specification matching, and a"data"
tuple containing the(xs, ys)
where each element is a pair of(np.ndarray)
with the nominal values and standard deviations.chromspec (
dict
) – Parsed calibration information, with keys being the detector names in the calibration file, and values containing the"id"
for detector matching,"peakdetect"
dictionary with peak-picking and edge-finding settings, and"species"
dictionary with names of species as keys and the left, right limits and calibration information as values.
- Returns
derived – A dictionary containing the derived data, including the peak heights, areas, concentrations, mol fractions, and a dictionary with the peak picking information (name, maximum, limits, height, area).
- Return type
dict[dict]
- yadg.parsers.chromtrace.main.parse_detector_spec(calfile=None, detectors=None, species=None)
Chromatography detector parser.
Combines the specification provided in
calfile
with that provided indetectors
andspecies
.The format of
calfile
is as follows:"{{ detector_name }}": # name of the detector id: !!int # ID of the detector used for matching prefer: !!bool # whether to prefer this detector for xout calc peakdetect: window: !!int # Savigny-Golay window_length = 2*window + 1 polyorder: !!int # Savigny-Golay polyorder prominence: !!float # peak picking prominence parameter threshold: !!float # peak edge detection threshold species: "{{ species_name }}": # name of the analyte l: !!float # peak picking left limit [s] r: !!float # peak picking right limit [s] calib: {} # calibration specification unit: !!str # optional unit for the concentration, by default %
Note
The syntax of the calibration specification is detailed in
yadg.dgutils.calib.calib_handler()
.The format of
detectors
is as follows:"{{ detector_name }}": # name of the detector id: !!int # ID of the detector used for matching prefer: !!bool # whether to prefer this detector for xout calc peakdetect: window: !!int # Savigny-Golay window_length = 2*window + 1 polyorder: !!int # Savigny-Golay polyorder prominence: !!float # peak picking prominence parameter threshold: !!float # peak edge detection threshold
The format of
species
is as follows:"{{ detector_name }}": # name of the detector species: "{{ species_name }}": # name of the analyte l: !!float # peak picking left limit [s] r: !!float # peak picking right limit [s] calib: !!calib # calibration specification
Note
The syntax of the calibration specification is detailed in
yadg.dgutils.calib.calib_handler()
.- Parameters
calfile (
Optional
[str
]) – A json file containing the calibration data in the format prescribed above.detectors (
Optional
[dict
]) – A dictionary containing the"id"
,"peakdetect"
and"prefer"
keys for each detector, as shown here.species (
Optional
[dict
]) – A dictionary containing the species names as keys and their specification as dictionaries, as shown here.
- Returns
calib – The combined calibration specification.
- Return type
dict
- yadg.parsers.chromtrace.main.process(fn, encoding='utf-8', timezone='localtime', parameters=None)
Unified raw chromatogram parser.
This parser processes GC and LC chromatograms in signal(time) format. When provided with a calibration file, this tool will integrate the trace, and provide the peak areas, retention times, and concentrations of the detected species.
- Parameters
fn (
str
) – The file containing the trace(s) to parse.encoding (
str
) – Encoding offn
, by default “utf-8”.timezone (
str
) – A string description of the timezone. Default is “localtime”.parameters (
Optional
[BaseModel
]) – Parameters forChromTrace
.
- Returns
(data, metadata, fulldate) – Tuple containing the timesteps, metadata, and full date tag. All currently supported file formats return full date.
- Return type
tuple[list, dict, bool]