Key features of yadg
Units and uncertainties
One of the key features of yadg is the enforced association of units and uncertainties with measured properties. This means that all experimental quantities are accompanied by an uncertainty estimate, derived either from the (string -> float
) representation of the data, or from instrumental resolution, if known.
Units
In the resulting NetCDF
files, the unit annotations are stored in .attrs["units"]
on each xarray.DataArray
, that is within each “column” of each “node” of the datatree.DataTree
. If an entry does not contain .attrs["units"]
, the quantity is dimensionless.
Warning
A special pint.UnitRegistry
was exposed in yadg-4.x
under yadg.dgutils.ureg
. Use of this pint.UnitRegistry
is deprecated as of yadg-5.0
, and it will be removed in yadg-6.0
.
Uncertainties
In many cases it is possible to define more than one uncertainty for each measurement: for example, accuracy, precision, as well as instrument resolution etc. may be available. The convention in yadg is that when both a measure of within-measurement uncertainty (resolution) and a cross-measurement error (accuracy) are available, the stored uncertainty corresponds to the instrumental resolution associated with each datapoint, i.e. the resolution. The precision of the measurement (which is normally a higher value than that of the resolution) can be obtained using post-processing, e.g. as a mean()
and stdev()
of a series of data.
Unless more information is available, when converting str
data to float
, the uncertainty is determined from the last significant digit specified in the str
. For this, the functionality from within the uncertainties package is used.
In the resulting NetCDF
files, the uncertainties for each f"{entry}"
are stored as a separate data variable, f"{entry}_std_err"
. The link between the nominal value and its uncertainty is annotated using .attrs["ancillary_variables"] = f"{entry}_std_err"
. The reverse link between the uncertainty and its nominal value is annotated similarly, using .attrs["standard_name"] = f"{entry} standard_error"
. This follows the NetCDF CF Metadata Conventions, see Section 3.4 on Ancillary Data.
Timestamping
Another key feature in yadg is the timestamping of all datapoints. The Unix timestamp is used, as it’s the natural timestamp for Python, and with its resolution in seconds it can be easily converted to minutes or hours.
Most of the supported file formats contain a timestamp of some kind. However, several file formats may not define both date and time of each datapoint, or may define neither. That is why yadg includes a powerful “external date” interface, see complete_timestamps()
.
Dataschema validation
Additionally, yadg provides dataschema validation functionality, by using the schema models from the dgbowl_schemas.yadg.dataschema
package, implemented in Pydantic. The schemas are developed in lockstep with yadg. This Pydantic-based validator class should be used to ensure that the incoming dataschema is valid.