Enforce pre-defined validation constraints¶
In a previous guide, you defined validation constraints ad-hoc when initializing Curator
objects.
Often, you want to enforce a pre-defined set of validation constraints, like, e.g., the CELLxGENE curator (Curate AnnData based on the CELLxGENE schema).
This guide shows how to subclass Curator
to enforce pre-defined constraints.
Define a custom curator¶
Consider the example of electronic health records (EHR). We want to ensure that
every record has the fields
disease
,phenotype
,developmental_stage
, andage
values for these fields map against specific versions of pre-defined ontologies
The following implementation achieves the goal by subclassing DataFrameCurator
.
import bionty as bt
import pandas as pd
from lamindb.core import DataFrameCurator, Record, logger
from lamindb.core.types import UPathStr, FieldAttr
__version__ = "0.1.0"
class EHRCurator(DataFrameCurator):
"""Custom curation flow for electronic health record data."""
def __init__(self, data: pd.DataFrame | UPathStr):
# Curate these columns against the specified fields
DEFAULT_CATEGORICALS = {
"disease": bt.Disease.name,
"phenotype": bt.Phenotype.name,
"developmental_stage": bt.DevelopmentalStage.name,
}
# If columns or values are missing, we substitute with these defaults
DEFAULT_VALUES = {
"disease": "normal",
"development_stage": "unknown",
"phenotype": "unknown",
}
# Validate values onto the following ontology versions
DEFAULT_SOURCES = {
"disease": bt.Source.get(
entity="bionty.Disease", name="mondo", version="2023-04-04"
),
"developmental_stage": bt.Source.get(
entity="bionty.DevelopmentalStage", name="hsapdv", version="2020-03-10"
),
"phenotype": bt.Source.get(
entity="bionty.Phenotype",
name="hp",
version="2023-06-17",
organism="human",
),
}
self.data = data
for col, default in DEFAULT_VALUES.items():
if col not in self.data.columns:
self.data[col] = default
else:
self.data[col].fillna(default, inplace=True)
super().__init__(
df=self.data,
categoricals=DEFAULT_CATEGORICALS,
sources=DEFAULT_SOURCES,
organism="human",
)
def validate(self, organism: str | None = None) -> bool:
"""Validates the internal EHR standard."""
missing_columns = {"disease", "phenotype", "developmental_stage", "age"} - set(
self.data.columns
)
if missing_columns:
logger.error(
f"Columns {', '.join(map(repr, missing_columns))} are missing but required."
)
return False
return DataFrameCurator.validate(self, organism)
Use the custom curator¶
!lamin init --storage ./subclass-curator --schema bionty
→ connected lamindb: testuser1/subclass-curator
import lamindb as ln
import bionty as bt
import pandas as pd
from ehrcurator import EHRCurator
ln.track("2XEr2IA4n1w40000")
→ connected lamindb: testuser1/subclass-curator
/opt/hostedtoolcache/Python/3.12.7/x64/lib/python3.12/site-packages/anndata/_io/__init__.py:12: FutureWarning: Importing read_zarr from `anndata._io` is deprecated. Please use anndata.io instead.
warnings.warn(
→ notebook imports: bionty==0.52.0 ehrcurator lamindb==0.76.15 pandas==2.2.3
→ created Transform('2XEr2IA4'), started new Run('d9kUiVPm') at 2024-11-11 10:29:43 UTC
# create example DataFrame that has all mandatory columns but one ('patient_age') is wrongly named
data = {
'disease': ['Alzheimer disease', 'Diabetes mellitus', 'Breast cancer', 'Hypertension', 'Asthma'],
'phenotype': ['Cognitive decline', 'Hyperglycemia', 'Tumor growth', 'Increased blood pressure', 'Airway inflammation'],
'developmental_stage': ['Adult', 'Adult', 'Adult', 'Adult', 'Child'],
'patient_age': [70, 55, 60, 65, 12],
}
df = pd.DataFrame(data)
df
Show code cell output
disease | phenotype | developmental_stage | patient_age | |
---|---|---|---|---|
0 | Alzheimer disease | Cognitive decline | Adult | 70 |
1 | Diabetes mellitus | Hyperglycemia | Adult | 55 |
2 | Breast cancer | Tumor growth | Adult | 60 |
3 | Hypertension | Increased blood pressure | Adult | 65 |
4 | Asthma | Airway inflammation | Child | 12 |
ehrcurator = EHRCurator(df)
ehrcurator.validate()
Show code cell output
✓ added 3 records with Feature.name for columns: 'disease', 'phenotype', 'developmental_stage'
✗ Columns 'age' are missing but required.
/home/runner/work/lamindb/lamindb/docs/ehrcurator.py:49: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.
For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.
self.data[col].fillna(default, inplace=True)
/home/runner/work/lamindb/lamindb/docs/ehrcurator.py:49: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.
For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.
self.data[col].fillna(default, inplace=True)
False
# Fix the name of wrongly spelled column
df.columns = df.columns.str.replace("patient_age", "age")
ehrcurator.validate()
Show code cell output
• saving validated records of 'disease'
✓ added 4 records from public with Disease.name for disease: 'Alzheimer disease', 'diabetes mellitus', 'asthma', 'breast cancer'
• saving validated records of 'phenotype'
✓ added 3 records from public with Phenotype.name for phenotype: 'Increased blood pressure', 'Hyperglycemia', 'Mental deterioration'
• mapping disease on Disease.name
! 1 term is not validated: 'Hypertension'
→ fix typo, remove non-existent value, or save term via .add_new_from('disease')
• mapping phenotype on Phenotype.name
! 2 terms are not validated: 'Tumor growth', 'Airway inflammation'
→ fix typos, remove non-existent values, or save terms via .add_new_from('phenotype')
• mapping developmental_stage on DevelopmentalStage.name
! 2 terms are not validated: 'Adult', 'Child'
→ fix typos, remove non-existent values, or save terms via .add_new_from('developmental_stage')
False
# Use lookup objects to curate the values
disease_lo = bt.Disease.public().lookup()
phenotype_lo = bt.Phenotype.public().lookup()
developmental_stage_lo = bt.DevelopmentalStage.public().lookup()
df["disease"] = df["disease"].replace({"Hypertension": disease_lo.hypertensive_disorder.name})
df["phenotype"] = df["phenotype"].replace({
"Tumor growth": phenotype_lo.neoplasm.name,
"Airway inflammation": phenotype_lo.bronchitis.name}
)
df["developmental_stage"] = df["developmental_stage"].replace({
"Adult": developmental_stage_lo.adolescent_stage.name,
"Child": developmental_stage_lo.child_stage.name
})
ehrcurator.validate()
Show code cell output
• saving validated records of 'disease'
• saving validated records of 'phenotype'
• saving validated records of 'developmental_stage'
✓ 'disease' is validated against Disease.name
✓ 'phenotype' is validated against Phenotype.name
✓ 'developmental_stage' is validated against DevelopmentalStage.name
True
Show code cell content
!rm -rf subclass-curator
!lamin delete --force subclass-curator
• deleting instance testuser1/subclass-curator