basiccsv: Common tabular file parser

This parser handles the reading and processing of any tabular files, as long as the first line contains the column headers. By default, the second should contain the units. The columns of the table must be separated using a separator such as ,, ;, or \t.

Warning

Since yadg-4.2, the parser handles sparse tables (i.e. tables with missing data) by creating sparse datagrams, which means that the each element of the header might not be present in each timestep.

Note

basiccsv attempts to deduce the timestamp from the column headers, using yadg.dgutils.dateutils.infer_timestamp_from(). Alternatively, the column(s) containing the timestamp data and their format can be provided using parameters.

Usage

Available since yadg-4.0. The parser supports the following parameters:

pydantic model dgbowl_schemas.yadg.dataschema_4_2.step.BasicCSV.Params

Show JSON schema

{
   "title": "Params",
   "type": "object",
   "properties": {
      "sep": {
         "title": "Sep",
         "default": ",",
         "type": "string"
      },
      "strip": {
         "title": "Strip",
         "type": "string"
      },
      "units": {
         "title": "Units",
         "type": "object",
         "additionalProperties": {
            "type": "string"
         }
      },
      "timestamp": {
         "title": "Timestamp",
         "anyOf": [
            {
               "$ref": "#/definitions/Timestamp"
            },
            {
               "$ref": "#/definitions/TimeDate"
            },
            {
               "$ref": "#/definitions/UTS"
            }
         ]
      },
      "sigma": {
         "title": "Sigma",
         "deprecated": true,
         "type": "object",
         "additionalProperties": {
            "$ref": "#/definitions/Tol"
         }
      },
      "calfile": {
         "title": "Calfile",
         "deprecated": true,
         "type": "string"
      },
      "convert": {
         "title": "Convert",
         "deprecated": true
      }
   },
   "additionalProperties": false,
   "definitions": {
      "TimestampSpec": {
         "title": "TimestampSpec",
         "description": "Specification of the column index and string format of the timestamp.",
         "type": "object",
         "properties": {
            "index": {
               "title": "Index",
               "type": "integer"
            },
            "format": {
               "title": "Format",
               "type": "string"
            }
         },
         "additionalProperties": false
      },
      "Timestamp": {
         "title": "Timestamp",
         "description": "Timestamp from a column containing a single timestamp string.",
         "type": "object",
         "properties": {
            "timestamp": {
               "$ref": "#/definitions/TimestampSpec"
            }
         },
         "required": [
            "timestamp"
         ],
         "additionalProperties": false
      },
      "TimeDate": {
         "title": "TimeDate",
         "description": "Timestamp from a separate date and/or time column.",
         "type": "object",
         "properties": {
            "date": {
               "$ref": "#/definitions/TimestampSpec"
            },
            "time": {
               "$ref": "#/definitions/TimestampSpec"
            }
         },
         "additionalProperties": false
      },
      "UTS": {
         "title": "UTS",
         "description": "Timestamp from a column containing a Unix timestamp.",
         "type": "object",
         "properties": {
            "uts": {
               "$ref": "#/definitions/TimestampSpec"
            }
         },
         "required": [
            "uts"
         ],
         "additionalProperties": false
      },
      "Tol": {
         "title": "Tol",
         "description": "Specification of absolute and relative tolerance/error.",
         "type": "object",
         "properties": {
            "atol": {
               "title": "Atol",
               "type": "number"
            },
            "rtol": {
               "title": "Rtol",
               "type": "number"
            }
         },
         "additionalProperties": false
      }
   }
}

field sep: str = ',': Separator of table columns.

field strip: str = None: A str of characters to strip from headers & data.

field units: Optional[Mapping[str, str]] = PydanticUndefined: A dict containing column: unit keypairs.

field timestamp: Optional[Union[dgbowl_schemas.yadg.dataschema_4_2.timestamp.Timestamp, dgbowl_schemas.yadg.dataschema_4_2.timestamp.TimeDate, dgbowl_schemas.yadg.dataschema_4_2.timestamp.UTS]] = PydanticUndefined: Timestamp specification allowing calculation of Unix timestamp for each table row.

field sigma: Optional[Mapping[str, dgbowl_schemas.yadg.dataschema_4_2.parameters.Tol]] = None: External uncertainty specification.

DEPRECATED in DataSchema-4.2

This feature is deprecated as of yadg-4.2 and will stop working in yadg-5.0.

field calfile: Optional[str] = None: Column calibration specification.

DEPRECATED in DataSchema-4.2

This feature is deprecated as of yadg-4.2 and will stop working in yadg-5.0.

field convert: Optional[Any] = None: Column renaming specification.

DEPRECATED in DataSchema-4.2

This feature is deprecated as of yadg-4.2 and will stop working in yadg-5.0.

DEPRECATED in yadg-4.2

The sigma, convert and calfile parameters are deprecated as of yadg-4.2 and will stop working in yadg-5.0.

Provides

The primary functionality of basiccsv is to load the tabular data, and determine the Unix timestamp. The headers of the tabular data are taken verbatim from the file, and appear as raw data keys:

- uts: !!float
  fn:  !!str
  raw:
      "{{ column_name }}":
          {n: !!float, s: !!float, u: !!str}

Submodules

yadg.parsers.basiccsv.main.process_row(headers, items, units, datefunc, datecolumns, calib={})

A function that processes a row of a table.

This is the main worker function of basiccsv, but can be re-used by any other parser that needs to process tabular data.

This function processes the "calib" parameter, which should be a (dict) in the following format:

- new_name:     !!str    # derived entry name
  - old_name:   !!str    # raw header name
    - calib: {}          # calibration specification
    fraction:   !!float  # coefficient for linear combinations of old_name
  unit:         !!str    # unit of new_name

The syntax of the calibration specification is detailed in yadg.dgutils.calib.calib_handler().

Parameters

headers (list) – A list of headers of the table.
items (list) – A list of values corresponding to the headers. Must be the same length as headers.
units (dict) – A dict for looking up the units corresponding to a certain header.
datefunc (Callable) – A function that will generate uts given a list of values.
datecolumns (list) – Column indices that need to be passed to datefunc to generate uts.
calib (dict) – Specification for converting raw data in headers and items to other quantities. Arbitrary linear combinations of headers are possible. See the above section for the specification.

Returns

element – A result dictionary, containing the keys "uts" with a timestamp, "raw" for all raw data present in the headers, and "derived" for any data processes via calib.

Return type

dict

yadg.parsers.basiccsv.main.process(fn, encoding='utf-8', timezone='localtime', parameters=None)

A basic csv parser.

This parser processes a csv file. The header of the csv file consists of one or two lines, with the column headers in the first line and the units in the second. The parser also attempts to parse column names to produce a timestamp, and save all other columns as floats or strings.

Parameters

fn (str) – File to process
encoding (str) – Encoding of fn, by default “utf-8”.
timezone (str) – A string description of the timezone. Default is “localtime”.
parameters (Optional[BaseModel]) – Parameters for BasicCSV.

Returns

(data, metadata, fulldate) – Tuple containing the timesteps, metadata, and full date tag. No metadata is returned by the basiccsv parser. The full date might not be returned, eg. when only time is specified in columns.

Return type

tuple[list, dict, bool]