Developer documentation
The project follows fairly standard developer practices. Every new feature should be associated with a test, and every PR requires linting and formatting using ruff
.
Testing
Tests are located in the tests
folder of the repository, and are executed using pytest
for every commit in every PR.
If a new test requires additional data (input files, schemas, etc.), they can be placed in a folder using the name of the test module (that is, test_yadg.py
has its test files in test_yadg
folder). Tests for new extractors should be added into separate test modules, using test_x_{extractor_name}.py
nomenclature.
A convenient testing function compare_datatrees()
is available in the tests.utils
module. This function is useful for comparing two DataTree
object, optionally including metadata.
Formatting
All files should be formatted by ruff format
. Lines containing text fields, including docstrings, should be between 80-88 characters in length. Imports of functions should be absolute, that is including the yadg.
prefix.
Implementing new features
New extractors should be implemented by:
adding their schema into
dgbowl_schemas.yadg.dataschema.DataSchema
adding their implementation in a separate Python package under
yadg.extractors
Each extractor should be documented by adding a structured docstring at the top of the file. This documentation should describe the application and usage of the extractor, and refer to the Pydantic audotocs via dataschema
to discuss the features exposed via the parameters dictionary. If the filetype extracted is binary, a description of the file structure should be provided in the docstring. Every new filetype will have to be added into the filetype
module as well.