wiki:MultiSiteDev

The GPC welcomes community participation:

See also collaborative development efforts in other PCORNet CDRNs and PPRNs.

HERON ETL Development Process

The KUMC dev team publishes heron_load, which is a delayed snapshot of most of the ETL code for HERON. But it's not the actual version control system used for development.

The following process is emerging to facilitate more direct collaboration among GPC sites:

  1. The HERON integrators are currently Dan Connolly and Lav Patel (see DevTeams#KUMC for contact info); they manage https://github.com/kumc-bmi (since #463).
  2. To participate in shared development, send your github username to the HERON integrators, who will give you read (clone / fetch / pull) access to the HERON repository in https://github.com/kumc-bmi .
  3. GPC peers should use pull requests to contribute changes.

ETL Architecture Brainstorm

In early Feb 2016, the KUMC team is looking back at HERON ETL experience and brainstorming alternatives and refinements:

Concrete outcomes include:

  1. using GROUSE ETL as an opportunity to start fresh
    • using Luigi, with its support for running tasks in parallel in place of paver
  2. aiming to flip/flop within one Oracle database instance (by using aliases and renaming tables or something) (KUMC:ticket:3871)

etl-dev tickets

query:component=etl-dev

Each code branch should correspond to either an enhancement or defect ticket. Diagnosing a problem ticket might turn it into a defect or enhancement (or result in related defects or enhancements). Likewise resolving a design issue ticket or completing a task ticket.

To the extent feasible, start feature branches at the most recent merge-release-to-default commit. That way, other sites can pick and choose which new features to adopt between releases.

Code Sharing Goals and Principles

Application of quality software engineering principles promotes sharing of development effort within the GPC:

  • Thorough documentation and testing helps manage common risks:
    • I can't tell which end is up; it's easier to build my own than to fix or enhance that code.
    • I don't want to change it because I'm afraid I'll break it.
  • Modularity is its own reward:
    • ETL modules should be self contained, having everything needed to run except the raw data (which can be replaced with self-contained demo data for testing purposes).
    • Within an ETL module related functionality should be grouped within the same file or section of a file in order to make it more obvious where to look for bugs.
  • Principle of least surprise:
    • At every level, the ETLs should strive to violate the principle of least surprise as little as possible, so that the experience of one site's developers is as transferrable as possible to other sites.

Toward a Code Review Checklist

adapted from KUMC:CodeReviewNotes

  • Is the use-case clear? (cf. StoryTellingAndTestCases, KUMC:HeronLoad#ETLDevelopment)
  • Is the use-case reasonably well captured in automated tests?
    • Is it clear how to run the tests? which end of the code is up?
      • e.g. with a README
      • Are code dependencies documented?
        • cf. KUMC:WritingQualityCode#pypy
    • Note: tests should tell a story; don't use names like foo and bar.
  • Is the code something you would be happy to maintain? (cf. KUMC:WritingQualityCode)
    • Is the code DRY?
    • Is unit test coverage adequate?
    • Could we expect our peers in the informatics community to use this code?
      • Note: references to tickets from code are OK, but only in a supplementary role; they should not be essential to understanding the purpose or design of the code.
  • Is the code secure? Is the flow of authority? clear?
  • Are external design constraints clear?
    • For example, is it clear which names were made up and which come from elsewhere?
    • Are sources cited?
  • Is installation, deployment reasonably well documented?
    • Can you reproduce the results, i.e. can you run integration tests?
    • Is logging sufficient to diagnose problems?
      • production logging level should be INFO or above; use log.debug() only for debugging
  • Is the copyright notice clear and correct?
    • If we're using anyone else's work, is it clearly licensed and acknowledged?

Documentation, Unit Testing

The python doctest module makes unit testing and documentation convenient and rewarding. As explained in StoryTellingAndTestCases, tests should tell a story; don't use names like 'foo' and 'bar'.

Using Sphinx markup conventions results in nice python API documentation.

For .sql as well:

  • 1st line is a module header that tells purpose/scope
  • short copyright, license blurb
  • cite sources; sketch design

Tasks and dependencies

The HERON ETL code is mostly SQL, but it's wrapped in paver scripts. Paver is a python library that provides task dependency management, somewhat like make.

For details, see the HERON ETL presentation from HackathonOne, as well as #71 and #128.

Toward Modular ETL

To get the data into i2b2, the process we've used so far in HERON is:

  1. transform source data (preferably using views? #112)
  2. populate dimensions
    • PATIENT_DIMENSION, VISIT_DIMENSION, PATIENT_MAPPING, and ENCOUNTER_MAPPING (and, if needed, PROVIDER_DIMENSION)
  3. load facts
    • OBSERVATION_FACT
    • Note the use of UPLOAD_STATUS table
  4. load concepts/terms
    • Use separate tables for each terminology and then combine into HERON_TERMS for counts.
      • CONCEPT_DIMENSION is populated from HERON_TERMS
    • Not sure how CODE_LOOKUP is supposed to be used; haven't run into a need for it yet

This assumes the data is staged. More on that below.

File layout and naming

Most of the python and SQL code is in the heron_load directory with names such as naaccr_facts_load.sql and naaccr_concepts_load.sql so that code for the same data source sorts together; its heron_staging peer has a subdirectory for each source, and the heron_load/curated_data contains mostly .csv files for terminologies.

The heron_load directory is getting unwieldy; we're considering separate directories for each data source; for example, ntds with subdirectories for staging, load, and curated_data. But moving files is expensive (e.g. web links break) so we're likely to do this only for new code.

  • heron_load/heron_init.py collects tasks that create schemas and hence require unusual DB privileges.
  • heron_init.py als populates some lookup/metadata table to avoid circular dependencies.
  • epic_etl.py, idx_etl.py etc. transform and load data
  • Question: are the only files that paver imports the ones specified in pavement.py? If so, does this mean that if an ETL module is in its own subdirectory, pavement.py needs to be updated so that the new scripts will be recognized?
    • That's one option. A couple others:
      1. Use paver -f nifty_tasks.py nifty_task1 rather than referencing pavement.py implicitly.
      2. Put a pavement.py in the ETL module subdirectory and use that.
        • Factor out any parts of heron_load/pavement.py that you want to re-use.

Note some particularly relevant tickets:

#38
Support service name as well as SID in configuration file for ETL code shared by KUMC
#46
establish gpc-dev text chat, regular discussion
#56
shareable synthetic test data sets: Epic clarity, i2b2, NAACCR, ...
#125
Allow skipping tasks in paver config

Last modified 2 years ago Last modified on Jul 30, 2019 9:38:05 AM

Attachments (1)

Download all attachments as: .zip