wiki:DataSecurity

nearby: KeyGoalTracking, DataRepositoryManagement, ModelMashUp, SoftwareDev, DevTeams

Data Sharing for Breast Cancer Survey

moved to BreastCancerDataSharing, including WP1, WP2, WP3, WP4

De-identification Plan

brainstorming on how to arrange work for the following milestone:

  • #110 3.15 Approved approaches to securely de-identify and re-identify patient data are implemented (RC11) Connolly/Tachinardi 5/1/14 6/15/14
    1. Sites implement approach 5/1/14 5/31/14
    2. Conduct test with sites using real data 5/21/14 6/15/14

Reviewing our approach (ticket:73#comment:8 Apr 14 to PCORI; no substantive changes since March 10 draft ticket:73#comment:3), it seems to have the following obligations outstanding:

  • A second technical document describing exactly how we apply this approach will also be developed.
  • Zip Codes will be truncated to 3 digit zip codes, except in locations where there are 20,000 people or less where it will be changed to 000.*
    • HERON ETL from DevTeams#kumc strips all zip codes in favor of state, categories of distance from KUMC, and a handful of school zones.
  • Dates - do all sites have our date shifting approach (#26) implemented?
  • We will create an instance_mapping table to de-identify/hash this value.
    • We're working on this for HERON. How about other sites?
  • The patient_mapping, encounter_mapping, and instance_mapping tables should be maintained between data loads so that a patient or encounter set created on one data load will work on the next data load.
    • Oops. I missed that in reviewing on behalf of DevTeams#kumc. The KUMC HERON team only preserves mappings for 3 months, lest the mapped identifiers become long-term patient identifiers.
  • ... we will explore potential automatic testing to look for potential identifiers in the de-identified database.  For example, looking for possible names, phone number, MRNs etc. in the terminology, nval_num, and tval_char fields of i2b2.
  • ... we will remove identifiable information of providers from the de-identified dataset.
    • HERON doesn't support provider identifiers yet (except by way of REDCap databases)
  • We recommend that a researcher who has a need for identified data seek IRB approval and upon approval, we will re-run a given query taken from the de-identified database on an identified data set.
    • We're working on this for HERON. We don't have a complete plan; I expect it will take several months to achieve.

Open, current priority data-sharing TracTickets:

Ticket Type Summary Component Owner Reporter Keywords Blocking Modified
#358 task NAACCR ETL: update all sites to include summary of treatment etc. data-sharing jay.pedersen dconnolly 6 months
#478 enhancement ADAPTABLE monitoring data-sharing schandaka schandaka 8 months
#545 task Study sample and DM sample definition for next-d data-sharing tmcneely afurmanchuck #598 2 months
#565 task GPC PI-Connect Phase-II Data Request data-sharing schandaka schandaka PI-Connect 12 months
#573 enhancement de-identified text notes for on a cohort-by-cohort basis data-sharing gkowalski dconnolly #431 2 weeks
#588 enhancement SNOW SHRINE node at UIOWA data-sharing nsmith bgryzlak 2 months
#600 problem SNOW SHRINE spoke at MU data-sharing tmcneeley rwaitman 3 months
#618 task collect WISC i2b2 repository copy for GROUSE data-sharing lpatel dconnolly 2 weeks
#619 enhancement GROUSE: integrate CDM data from GPC sites data-sharing rwaitman rwaitman GROUSE 2 weeks
#620 enhancement annual update of GROUSE i2b2 data from sites data-sharing rwaitman rwaitman 9 months
#631 enhancement NAACCR Tumor Registry query via GPS SNOW SHRINE data-sharing preeder rwaitman NAACCR breast-cancer 3 months
#644 task collect remaining i2b2 datamarts for GROUSE data-sharing lv dconnolly 7 days
#681 enhancement SNOW SHRINE spoke at WISC data-sharing mish dconnolly 3 months
#684 enhancement notes onto GROUSE for Mary’s project data-sharing rwaitman dconnolly 17 hours
#688 problem National Death data feed unavailable due to new site certification requirements data-sharing dconnolly dconnolly 3 weeks
#692 task Finder file refresh for GROUSE (UTHSCSA) data-sharing bokov schandaka GROUSE 4 days
#693 task Finder file refresh for GROUSE(UIowa) data-sharing nsmith schandaka GROUSE 4 days
#696 task Finder file refresh for GROUSE (MCRF) data-sharing lv schandaka GROUSE 4 days
#697 task Finder file refresh for GROUSE(MU) data-sharing mosaa@… schandaka GROUSE 17 hours
#698 task Finder file refresh for GROUSE (UTSW) data-sharing preeder schandaka GROUSE 4 days
#700 task Finder file refresh for GROUSE(IU) data-sharing dhood schandaka GROUSE 2 weeks
#701 problem UMN - Pull and submit tumor analysis dataset and CDM tables for cohort (2nd submission) data-sharing gweaver bgryzlak 7 days
#703 problem Final report to PCORI data-sharing bgryzlak bgryzlak 2 weeks

Review Criterion 11: Clear, thorough, and proven policies to maintain data security, patient privacy, and confidentiality, as well as organizational privacy.

RC11 from the GPC Proposal:

Providing access to health data for research while preserving privacy and security poses challenges at many levels, within sites and across sites. The GPC sites bring considerable experience for meeting these challenges at the site level. As a network we have plans for a balanced approach to multi-site data integration for the initial cohort characterization and terminology alignment work, as well as for enhancing methods for supporting Comparative Effectiveness Research (CER) trials. The HERON project at the University of Kansas Medical Center (KUMC)1 represents several years of experience providing local researchers with interactive access to a de-identified clinical data repository. After an initial pilot of diagnoses and procedures of a few thousand clinic patients, we established a master data sharing agreement and governance committee for HERON and its use. Over the course of more than 25 consecutive monthly updates, the HERON data repository now contains over a billion facts integrating the hospital EHR data with a tumor registry, the social security death index, and other data sources. After initially deploying support for only counting queries, the KUMC Data Request Oversight Committee (DROC) evaluated the risk/reward balance of granting users system access to i2b2 “analysis tool” plug-ins, which provide interactive views of line-item data. We developed rgate2,3, a gateway from i2b2 plugins to the R statistical package, and used it as a basis for interactive survival analysis. Adoption of HERON has blossomed with these tools—e.g., in August, 2013, 1157 queries were executed by 52 distinct users.

KUMC next developed web-based auditing methods to assist the healthcare system and honest broker in reviewing queries to manage the re-identification risk, to easily review concepts used for business sensitivities, and to confirm that the hypotheses generation for a requested study is in the investigator’s area of expertise. Through these activities, we have developed mature tools and secure, efficient processes for fulfilling data requests (encompassing de-identified data, identified data, or contact information for trial recruitment) after review and approval by the DROC. To facilitate rich data analysis while keeping datasets on the secure server, we also developed an R DataBuilder plugin4,5 that integrates with R Studio Server6, an interactive development environment (IDE) for R. KUMC has an established contract with Amazon Web Services (AWS) to facilitate sharing scalable software services. Amazon recently announced the support for HIPAA business associates agreements (BAAs), and KUMC Medical Informatics is working with general counsel to finalize an acceptable BAA prior to January 2014. Given all these experiences, we are well equipped to facilitate research methods while maintaining privacy and security at the network and national level.

All GPC sites are experienced with maintaining HIPAA compliant data repositories and systems to collect data for prospective trials. Informatics leads at each site are responsible for working with their IRB, their healthcare system compliance/privacy officers, and their affiliated IT teams on issues of data security and privacy (both patient privacy and organizational privacy). The KUMC team has already consulted with several sites and will continue to do so throughout the project.

The choice of i2b2 as a data repository platform suggests using SHRINE7 for multi-site integration. SHRINE, however, emphasizes queries that are fully automated, gives simple counts as results, and requires fully automated terminology alignment. Creating a high functioning PCORI network across the GPC requires queries with rich data sets as results. We also recognize healthcare systems EHRs’ Meaningful Use alignment is a process unfolding during the study period, so our approach to terminology alignment requires an incremental approach involving a feedback loop of conducting queries and adjusting terminology mappings. Also, in order to reduce concerns from healthcare systems as we build trust, we believe the added step of mediation by honest brokers, rather than full automation during this initial project period, is the appropriate way to move forward.

Data integration between GPC sites takes two main forms: (1) terminology alignment and (2) data aggregation and analysis. To facilitate terminology alignment, we plan to support a shared i2b2 installation that supports terminology browsing but has no data for querying (#1). For aggregate analysis, we plan to build on R DataBuilder to integrate i2b2 with R Studio Server, a web-hosted integrated development environment for the R statistical package.

Since the terminology alignment reference i2b2 service has no patient data, hosting constraints are minimal with respect to HIPAA and Human Subjects Protection. The main security concerns for this service are to address potential healthcare system business sensitivities related to patient care volumes and to not disclose research in-progress. We plan to address these concerns by limiting access to authorized users in the GPC community after approval by site and GPC level DROC (#93). All GPC sites are capable of hosting the terminology reference i2b2 service. Initially, KUMC will integrate i2b2 with the enterprise directory, as in HERON (more likely we'll skip to federated login below). KUMC has existing infrastructure for setting up affiliate accounts that take anywhere from a few hours to a few days—quite reasonable for the initial phase when the user-base numbers a few dozen. As the GPC expands this service to a wider user community, we will deploy services to cloud hosting at Amazon(e.g. babel #1) and develop an independent federated login system such as Shibboleth8 or InCommon9 for authorization and authentication (#93, #174).

Balancing research objectives with patient privacy requirements is more involved for aggregation and analysis of patient data. We will aggregate only de-identified data and use a more mediated, rather than automated, approach to federated queries. When exchanging de-identified data, we exclude not only the 18 identifiers mandated by HIPAA, but all free text. Since the objective is to create standardized, structured data that is actionable within widely deployed EHRs, we see no requirement for the GPC to exchange or maintain free text data. This also mitigates PHI disclosure, re-identification risks, and business sensitivities. An individual site may, at its discretion, use text mining techniques to extract data from free text, but only the resulting structured data will be sent to the GPC. We will ensure fully de-identified processes are deployed across all sites for structured data. To address HIPAA requirements that dates not be reported at finer resolution than a year while preserving many temporal aspects of cases critical to research, each time we update the de-identified data repository from source systems, we shift dates on a per-patient basis randomly between 0 and 365 days. Access to the resulting fully de-identified i2b2 repository has been determined by the KUMC IRB to be non-human subjects research. Oversight, however, is provided by the DROC to review healthcare system concerns regarding business sensitivities.

The process of federated query begins with approval by the GPC DROC of a query with respect to the shared GPC terminology. The GPC DROC contacts the DROC at each site where the query is to be executed for local approval as outlined in RC5. The honest brokers at those sites run the query at their sites and upload the results to a GPC data store.

The design for the GPC data store consists of a REDCap service (#159) and an R Studio service. In addition to the core case-report-form features of REDCap needed for prototyping patient reported outcome instruments, REDCap provides a simple project-based workflow and access control model that is well suited to the access patterns required by the GPC. For example, to submit queries from the GPC DROC to honest brokers at each site, we will add all of the honest brokers as users in a REDCap project that has a query submission survey. On receipt of a query via this survey, each honest broker executes the query against their i2b2 installations and saves the results to an R data file using the R DataBuilder(#202). We will use another REDCap project to collect and distribute the results of the query. The honest brokers upload the R data file as file attachments to this project. The REDCap service will use secure HTTP (TLS/SSL) so that the file transfer is encrypted. The GPC honest broker then releases the collection of data files to the investigators by arranging for them to have access via R Studio Server(#213). The R statistical package provides a large toolset for combining de-identified data files for analysis, and the R Studio Server allows researchers to use this toolset while the data remain in the GPC data store.

While fully de-identified data sets suffice for cohort characterization and terminology alignment work, we anticipate that the Innovative Research Methodology Core (IRMC) team (see RC9) will require limited datasets that provide actual dates for monitoring prospective CER trials. We will develop a method for the honest broker to re-combine de-identified R DataBuilder files with the per-patient random date offset table (from the de-identification process) to restore the actual dates (#110). We also will work with the IRMC to develop statistical methods to evaluate re-identification risk in fully de-identified and limited datasets and collaborate with the emerging national network to implement best practices.

The GPC data store complies with KUMC policies and procedures for installing, operating, and monitoring public-facing web services that store protected health information (PHI). For example, a system cannot be placed into production use with protected health information (PHI) until the KUMC Information Security Officer approves a HIPAA certification checklist and formal risk assessment report. Currently, i2b2 and the R Studio Server operate within KUMC’s enterprise firewall. Anticipating the need for scalable data analysis and machine learning used for translational informatics10, we will implement KUMC information security best practices in the Amazon Web Services (AWS) environment. Access will be limited to the IRMC team and approved GPC co-investigators, using KUMC’s virtual private network (VPN) or Amazon’s Virtual Private Cloud, monitored by KUMC Information Security. These systems will be readily applied across the GPC and, importantly, will inform national collaborations.

  1. Waitman LR, Warren JJ, Manos EL, Connolly DW. Expressing Observations from Electronic Medical Record Flowsheets in an i2b2 based Clinical Data Repository to Support Research and Quality Improvement. AMIA. Annual Symposium Proceedings; 2011, 1454–1463.
  2. Connolly DW, Adagarla B, Keighley J, Waitman LR. Integrating R efficiently to allow secure, interactive analysis within a clinical data warehouse. 8th Int. UseR Conf. (2012). http://biostat.mc.vanderbilt.edu/wiki/pub/Main/UseR-2012/141-Connolly.pdf ‎ Last accessed September 25, 2013.
  3. Connolly D. rgate -- gateway between i2b2 plugins and R. (University of Kansas Medical Center, 2012). https://informatics.kumc.edu/work/wiki/HeronStatsPlugins Last accessed September 25, 2013.
  4. Connolly DW, Waitman LR. Extending an I2B2­based Clinical Data Repository with the R Statistical Platform. 3rd Annual I2b2 Acad. User Group Conf. NLP Work; 2013.
  5. Connolly D. R Data Builder. https://informatics.kumc.edu/work/wiki/HeronStatsPlugins Last accessed September 25, 2013.
  6. Racine JS. RStudio: A Platform-Independent IDE for R and Sweave. J. Appl. Econ. 2013;27:167–172.
  7. McMurry? AJ, Murphy SN, MacFadden? D, Weber G, Simons WW, Orechia J, Bickel J, Wattanasin N, Gilbert C, Trevvett P, Churchill S, Kohane IS. SHRINE: enabling nationally scalable multi-site disease studies. PLoS One. 2013;8(3):e55811.Epub 2013 Mar
  8. Shibboleth. http://shibboleth.net/ Last accessed September 25, 2013.
  9. InCommon. https://incommon.org/ Last accessed September 25, 2013.
  10. Szalma S, Koka V, Khasanova T, Perakslis ED. Effective knowledge management in translational medicine. J Transl Med. 2010 Jul 19;8:68.

Related Work

To what extent could/should we leverage synapse?

Last modified 2 years ago Last modified on Dec 15, 2015 11:10:59 PM

Attachments (5)