Key features of yadg

Units and uncertainties

One of the key features of yadg is the enforced association of units and uncertainties with measured properties. This means that all experimental quantities are accompanied by an uncertainty estimate, derived either from the string -> float conversion of the data, or from instrumental resolution, if known.

Units

In the resulting NetCDF files, the unit annotations are stored in .attrs["units"] on each xarray.DataArray, that is within each “column” of each “node” of the DataTree. If an entry does not contain .attrs["units"], the quantity is dimensionless.

Warning

A special pint.UnitRegistry was exposed in yadg-4.x under yadg.dgutils.ureg. Use of this pint.UnitRegistry is deprecated as of yadg-5.0, and it will be removed in yadg-6.0.

Uncertainties

In many cases it is possible to define more than one uncertainty for each measurement: for example, accuracy, precision, as well as instrument resolution etc. may be available. The convention in yadg is that when both a measure of within-measurement uncertainty (resolution) and a cross-measurement error (accuracy) are available, the stored uncertainty corresponds to the instrumental resolution associated with each datapoint, i.e. the resolution. The precision of the measurement (which is normally a higher value than that of the resolution) can be obtained using post-processing, e.g. as a mean() and stdev() of a series of data.

Unless more information is available, when converting str data to float, the uncertainty is determined from the last significant digit specified in the str. For this, the functionality from within the uncertainties package is used.

In the resulting NetCDF files, the uncertainties for each f"{entry}" are stored as a separate data variable, f"{entry}_std_err". The link between the nominal value and its uncertainty is annotated using .attrs["ancillary_variables"] = f"{entry}_std_err". The reverse link between the uncertainty and its nominal value is annotated similarly, using .attrs["standard_name"] = f"{entry} standard_error". This follows the NetCDF CF Metadata Conventions, see Section 3.4 on Ancillary Data.

Timestamping

Another key feature in yadg is the timestamping of all datapoints. The Unix timestamp is used, as it’s the natural timestamp for Python, and with its resolution in seconds it can be easily converted to minutes or hours. All conversions of date and time objects into Unix timestamps are timezone-aware, with the timezone corresponding to the localtime used as a default.

Most of the supported file formats contain a timestamp of some kind. However, several file formats may not include a complete timestamp, by ommiting either the acquisition date or time, or both. That is why yadg includes a powerful “external date” interface, see complete_timestamps(), which allows you to supply timestamp information externally.

Locale support

Support for parsing decimal numbers in localized files is implemented in yadg via the babel library, allowing you to specify the locale of the file using standard locale strings, such as en_US or de_CH. This avoids “hacks” such as replacing decimal separators (, vs .) and thousands separators when processing localizable files. By default, yadg attempts to infer the locale from the LC_NUMERIC environment variable; if this is not set in your environment, en_GB is used as a fallback.

Note that locale settings currently do not affect processing of date and time strings.

Original metadata

By default, yadg attempts to decode and store all understood metadata present in the extracted files. Currently, this metadata is stored in the original_metadata entry within the .attrs on the DataTree nodes, which is serialised into json strings in the yadg.extractors.extract() function.

Warning

The original_metadata functionality has been introduced in yadg-5.1 and its implementation might change in future versions.

Note

When merging multiple files into one DataTree, it may happen that the original_metadata entry is not identical in between the processed files. In such cases, executing yadg with the --ignore-merge-errors option will drop the conflicting metadata entries and proceed with the processing.

DataSchema validation

Additionally, yadg provides DataSchema validation and updating functionality, by using the schema models from the dgbowl_schemas.yadg.dataschema package. The schemas are implemented in Pydantic, and are developed in lockstep with yadg. This Pydantic-based validator class should be used to ensure that the incoming dataschema is valid.