The GPC welcomes community participation:
- Progress toward technical milestones is documented in this project management web site for all to see.
- Our gpc-dev mailing list has an open archive and we welcome relevant posts from our peers in the wider community.
- Notes from our meetings are shared openly and we welcome guests with shared interests
- Notes from our weekly teleconferences go to gpc-dev, as do agendas the day before (#12).
- HackathonFour was Jan 2017 in Kansas City
- Our DevTeams share a variety of software projects and methods including
- HERON ETL from DevTeams#kumc - a standard set of scripts that use Python and Oracle PL/SQL and facilitates migration of transactional data into the vendor neutral i2b2 data warehouse. (presented at HackathonOne)
- SEINE DataBuilder from DevTeams#kumc is a method for fulfilling data extracts from i2b2
- DataFinisher from DevTeams#uthscsa post-processes DataBuilder output for analysis
- kumc-bmi/i2p-transform transforms an i2b2 data warehouse to a PCORNet CDM
- DevTeams#unmc shares their ETL methods and code to go from Epic clarity to i2b2 and then to PCORNet CDM
- MCW_BMI/unstructured-notes-deidentification from DevTeams#mcw provides a scalable pipeline for de-identification of patient records (presented at HackathonFour)
- UTHSCSA-CIRD/datafinisher helps find over and under represented EHR facts in an i2b2 patient-set.
See also collaborative development efforts in other PCORNet CDRNs and PPRNs.
HERON ETL Development Process
The KUMC dev team publishes heron_load, which is a delayed snapshot of most of the ETL code for HERON. But it's not the actual version control system used for development.
The following process is emerging to facilitate more direct collaboration among GPC sites:
- The HERON integrators are currently Dan Connolly and Lav Patel (see DevTeams#KUMC for contact info); they manage https://github.com/kumc-bmi (since #463).
- To participate in shared development, send your github username to the HERON integrators, who will give you read (clone / fetch / pull) access to the HERON repository in https://github.com/kumc-bmi .
- GPC peers should use pull requests to contribute changes.
ETL Architecture Brainstorm
In early Feb 2016, the KUMC team is looking back at HERON ETL experience and brainstorming alternatives and refinements:
Concrete outcomes include:
- using GROUSE ETL as an opportunity to start fresh
- using Luigi, with its support for running tasks in parallel in place of paver
- aiming to flip/flop within one Oracle database instance (by using aliases and renaming tables or something) (KUMC:ticket:3871)
etl-dev tickets
Each code branch should correspond to either an enhancement or defect ticket. Diagnosing a problem ticket might turn it into a defect or enhancement (or result in related defects or enhancements). Likewise resolving a design issue ticket or completing a task ticket.
To the extent feasible, start feature branches at the most recent merge-release-to-default commit. That way, other sites can pick and choose which new features to adopt between releases.
Code Sharing Goals and Principles
Application of quality software engineering principles promotes sharing of development effort within the GPC:
- Thorough documentation and testing helps manage common risks:
- I can't tell which end is up; it's easier to build my own than to fix or enhance that code.
- I don't want to change it because I'm afraid I'll break it.
- Modularity is its own reward:
- ETL modules should be self contained, having everything needed to run except the raw data (which can be replaced with self-contained demo data for testing purposes).
- Within an ETL module related functionality should be grouped within the same file or section of a file in order to make it more obvious where to look for bugs.
- Principle of least surprise:
- At every level, the ETLs should strive to violate the principle of least surprise as little as possible, so that the experience of one site's developers is as transferrable as possible to other sites.
Toward a Code Review Checklist
adapted from KUMC:CodeReviewNotes
- Is the use-case clear? (cf. StoryTellingAndTestCases, KUMC:HeronLoad#ETLDevelopment)
- Is the use-case reasonably well captured in automated tests?
- Is it clear how to run the tests? which end of the code is up?
- e.g. with a README
- Are code dependencies documented?
- cf. KUMC:WritingQualityCode#pypy
- Note: tests should tell a story; don't use names like foo and bar.
- Is it clear how to run the tests? which end of the code is up?
- Is the code something you would be happy to maintain? (cf. KUMC:WritingQualityCode)
- Is the code DRY?
- Is unit test coverage adequate?
- Could we expect our peers in the informatics community to use this code?
- Note: references to tickets from code are OK, but only in a supplementary role; they should not be essential to understanding the purpose or design of the code.
- Is the code secure? Is the flow of authority? clear?
- Are external design constraints clear?
- For example, is it clear which names were made up and which come from elsewhere?
- Are sources cited?
- Is installation, deployment reasonably well documented?
- Can you reproduce the results, i.e. can you run integration tests?
- Is logging sufficient to diagnose problems?
- production logging level should be INFO or above; use log.debug() only for debugging
- Is the copyright notice clear and correct?
- If we're using anyone else's work, is it clearly licensed and acknowledged?
Documentation, Unit Testing
The python doctest module makes unit testing and documentation convenient and rewarding. As explained in StoryTellingAndTestCases, tests should tell a story; don't use names like 'foo' and 'bar'.
Using Sphinx markup conventions results in nice python API documentation.
For .sql as well:
- 1st line is a module header that tells purpose/scope
- short copyright, license blurb
- cite sources; sketch design
Tasks and dependencies
The HERON ETL code is mostly SQL, but it's wrapped in paver scripts. Paver is a python library that provides task dependency management, somewhat like make.
For details, see the HERON ETL presentation from HackathonOne, as well as #71 and #128.
Toward Modular ETL
To get the data into i2b2, the process we've used so far in HERON is:
- transform source data (preferably using views? #112)
- populate dimensions
- PATIENT_DIMENSION, VISIT_DIMENSION, PATIENT_MAPPING, and ENCOUNTER_MAPPING (and, if needed, PROVIDER_DIMENSION)
- load facts
- OBSERVATION_FACT
- Note the use of UPLOAD_STATUS table
- load concepts/terms
- Use separate tables for each terminology and then combine into HERON_TERMS for counts.
- CONCEPT_DIMENSION is populated from HERON_TERMS
- Not sure how CODE_LOOKUP is supposed to be used; haven't run into a need for it yet
- Use separate tables for each terminology and then combine into HERON_TERMS for counts.
This assumes the data is staged. More on that below.
File layout and naming
Most of the python and SQL code is in the heron_load
directory with names such as naaccr_facts_load.sql
and naaccr_concepts_load.sql
so that code for the same data source sorts together; its heron_staging
peer has a subdirectory for each source, and the heron_load/curated_data
contains mostly .csv files for terminologies.
The heron_load
directory is getting unwieldy; we're considering separate directories for each data source; for example, ntds with subdirectories for staging
, load
, and curated_data
. But moving files is expensive (e.g. web links break) so we're likely to do this only for new code.
heron_load/heron_init.py
collects tasks that create schemas and hence require unusual DB privileges.heron_init.py
als populates some lookup/metadata table to avoid circular dependencies.epic_etl.py
,idx_etl.py
etc. transform and load data
- Question: are the only files that
paver
imports the ones specified inpavement.py
? If so, does this mean that if an ETL module is in its own subdirectory,pavement.py
needs to be updated so that the new scripts will be recognized?- That's one option. A couple others:
- Use
paver -f nifty_tasks.py nifty_task1
rather than referencingpavement.py
implicitly. - Put a
pavement.py
in the ETL module subdirectory and use that.- Factor out any parts of
heron_load/pavement.py
that you want to re-use.
- Factor out any parts of
- Use
- That's one option. A couple others:
Note some particularly relevant tickets:
Attachments (1)
-
IMG_20170203_120005784.jpg (709.7 KB) - added by 4 years ago.
HERON ETL architecture brainstorm
Download all attachments as: .zip