Industry Work
At Eleusis, I work with the Health Solutions team providing research insight to answer
the most challenging questions surrounding care delivery with the unique class of drugs, psychedelics.
We work to solve the issues surrounding patient discharge, predicting outcomes for subsets of patients, and screening
for potential contraindications for these types of emerging therapies.
Additionally, I recruited a team and developed a data integration pipeline to generate insights data science
that would support business and research decision making. This involved integrating
HL7, CDISC OMD XML, and other data structures into relational databases.
I developed proficiencies in PySpark, building schemas for structured data bases (json, YAML, pandera),
and reporting to non-technical audiences using dashboards and reporting tools (matplotlib, plotly, Grafana).
This was a unique opportunity to work with 3rd-party consultants to evaluate all the available
tools for building these types of data platforms.
I helped pioneer the use of Pandera to validate data that is passed through this pipeline. Anyone working with
large datasets knows that transformations can be fraught with errors. Validation tools like Pandera (Python),
allow the validation of data within the system that they are being modified in.
Just as YAML
schemas can be used to validate data structures, Pandera provides a Python framework for you to define data structure.
You can easily design and import schemas into various transformational scripts and test the inputs and outputs according to your specifications.
Below you can find an example of the types of flexibility new tools like Pandera can offer.
import pandera as pa
from pandera import Column, DataFrameSchema
# define input schema in pandera format (what the input should look like)
in_schema = DataFrameSchema(
columns={
'trial_id': Column(str, nullable=False), # trial name
'subject_id': Column(str, nullable=False), # subject ID number
'event_date': Column(pa.DateTime, nullable=False), # event date
}
)
# define output schema in pandera format (what the output should look like)
out_schema = DataFrameSchema(
columns={
'trial_id': Column(str, nullable=False), # trial name
'subject_id': Column(str, nullable=False), # subject ID number
'event_date': Column(pa.DateTime, nullable=False), # event date
'favorite_color': Column(str, nullable=False), # event date
}
)
These schema objects can then be used as decorators around any function in Python. It will validate
the input and output data upon execution allowing for a detailed review of validation errors automatically.
These can be used to stop automated data processing pipeline and prevent poor data from slipping into your database. This type of automation pipeline allows for the integration of many types of data without manual verification. This is vital when collecting data at a high frequency, and one of the many reasons I chose to work with this framework.
@pa.check_input(in_schema)
def load_data(path):
return pandas.read_csv(path)
@pa.check_output(out_schema)
def processing_fn(df):
return df.assign(favorite_color="Crimson")
For more examples of my work with data pipelines in Python, you can check the "Academic" section of this
site to see open-source examples of my code.