Key features of yadg

Units and uncertainties

One of the key features of yadg is the enforced association of units and uncertainties with measured properties. This means that all experimental quantities are accompanied by an uncertainty estimate, derived either from the (string -> float) representation of the data, or from instrumental resolution, if known.

Units

In the resulting NetCDF files, the unit annotations are stored in .attrs["units"] on each xarray.DataArray, that is within each “column” of each “node” of the datatree.DataTree. If an entry does not contain .attrs["units"], the quantity is dimensionless.

Warning

A special pint.UnitRegistry was exposed in yadg-4.x under yadg.dgutils.ureg. Use of this pint.UnitRegistry is deprecated as of yadg-5.0, and it will be removed in yadg-6.0.

Uncertainties

In many cases it is possible to define more than one uncertainty for each measurement: for example, accuracy, precision, as well as instrument resolution etc. may be available. The convention in yadg is that when both a measure of within-measurement uncertainty (resolution) and a cross-measurement error (accuracy) are available, the stored uncertainty corresponds to the instrumental resolution associated with each datapoint, i.e. the resolution. The precision of the measurement (which is normally a higher value than that of the resolution) can be obtained using post-processing, e.g. as a mean() and stdev() of a series of data.

Unless more information is available, when converting str data to float, the uncertainty is determined from the last significant digit specified in the str. For this, the functionality from within the uncertainties package is used.

In the resulting NetCDF files, the uncertainties for each f"{entry}" are stored as a separate data variable, f"{entry}_std_err". The link between the nominal value and its uncertainty is annotated using .attrs["ancillary_variables"] = f"{entry}_std_err". The reverse link between the uncertainty and its nominal value is annotated similarly, using .attrs["standard_name"] = f"{entry} standard_error". This follows the NetCDF CF Metadata Conventions, see Section 3.4 on Ancillary Data.

Timestamping

Another key feature in yadg is the timestamping of all datapoints. The Unix timestamp is used, as it’s the natural timestamp for Python, and with its resolution in seconds it can be easily converted to minutes or hours.

Most of the supported file formats contain a timestamp of some kind. However, several file formats may not define both date and time of each datapoint, or may define neither. That is why yadg includes a powerful “external date” interface, see complete_timestamps().

Dataschema validation

Additionally, yadg provides dataschema validation functionality, by using the schema models from the dgbowl_schemas.yadg.dataschema package, implemented in Pydantic. The schemas are developed in lockstep with yadg. This Pydantic-based validator class should be used to ensure that the incoming dataschema is valid.