Managing inter-organizational research projects with FHIR-based architecture

By: Alex Baumberg & Amos (Kippi) Bordowitz

One of the most basic problems we run against while using data from several organizations is the heterogeneity of data formats, and the means to access it. In addition, data tends to become fragmented, even within the same organization. During a research project, this problem can become acute.

Just before we delve into the problem let us make sure we understand it. Think of a pregnant person; imagine all the procedures they must go through during their pregnancy and leading up to the delivery. They might find themselves having tests and procedures in a variety of medical centers, from their GP all the way to the hospital in which they will give birth. In today’s world, unfortunately, every organization maintains the data differently, meaning that data preparation for Federated Learning* models becomes a slow and costly process, as all data must be standardized prior to its digestion.

*According to Wikipedia, Federated Learning is “a machine learning technique that trains an algorithm across multiple decentralized edge devices or servers holding local data samples, without exchanging them”.

The problem is compounded when one wants to run a research project using the data from several organizations. Research data must be precise, and all data of a certain type must be represented in the exact same way, or it might skew the final results, or even make the ability to get results considerably more difficult and time intensive. This can lead to loss of grant money and many man-hours. To add another level of complication, in most medical research there is a basic requirement that the data be de-identified, which, once again, if the data isn’t standardized, can lead to even further complications.

Here at Outburn, we have developed our own in-house solution for on-demand data de-identification, while simultaneously preparing the data for federated learning models.

In this article, we will outline our solution and its architecture.

Business workflow:

Our project, a collaboration with HUJI, is meant to assess the link between pregnant women who were exposed to Cytomegalovirus (CMV) and the health of their newborn children post-partum.

Once the researchers have received the Helsinki committee guidelines for their project, the next step is to build a de-identification profile for the data, according to said guidelines. In this project we received a profile set for de-identification of data from pregnant women (including such as number of pregnancies and pregnancies carried to term) and their newborns, along with several other data sources, such as medications taken by the mother and lab results.

Our de-identification repository now contains the created profile for use in future iterations of the research project, using different target populations. This allows us to work hard once and then reuse the products of our labor. We then built FHIR profiles that represent all the data required for the project and used them to populate a research FHIR sever. The server can be repopulated as often as needed with new data according to researcher requirements.


De-identification process:

As seen in the diagram above, the researcher must be authenticated in order to make sure that only people who have received the proper clearance can start the de-identification process, as they will also be the recipients of the research results.

Once logged in, the user defines the research population. The research query is then transferred to the authentication unit (step 1) for approval by the data security manager (step 2). Next, the researcher initiates the de-identification service (step 3) which queries the FHIR server and retrieves the data pertinent to the specific research question (step 4). The service then retrieves the de-identification profile and applies it to the data (step 5). The products are then stored in a secured, dedicated data storage unit (step 6) which will be accessed by the federated learning machine (“FLM” – step 7).

This entire process is run in each of the organizations taking part in the research project. FLMs in each organization creates a “model-result”, which is then exchanged between the organizations. Thus, the Federated machine learning technique trains an algorithm across multiple organizations by ” model-result” exchange between partner organizations participating in the project.


Technical Workflow:

Let’s take a deep dive into the diagram above:

1.     Data Source: Organizational data sources of identified data. These are fed into the ESB (Enterprise Service Bus system, a type of integration server). Each Organization is responsible for native to FHIR transaction transformations according to the FHIR data model specified for particular research. The data is fed into the –

2.     FHIR Server: Stores all the resources required for the project. The references between these resources compile the data model.

3.     FADE (FHIR Advanced Data Extract): Proprietary data preparation and de-identification tool. The following describes FADE’s workflow:

“Client” component, the application used by the researcher, is used for the entire data preparation and managing the de-identification process. The component is accessible by secure communication over SSL VPN and requires authorization through the Loopback access control lists (ACL) component. The ACL component controls user credentials (roles) linked to specific research(s). It’s aimed at controlling the initiation of data preparation and de-identification for authorized researchers only. The component uses methods exposed by the REST services of the data preparation component and provides the researcher with the following capabilities:

    1. Define research population – the system allows for construction of complex queries with include/exclude rules. This allows maximum flexibility for building complex research group criteria that cannot be expressed in a single FHIR search query.  The example of complex query contains include and exclude rules. “All patients between 20 and 30 years old, gender – female, including patients having given birth in last 5 years, including pregnancies having CMV IgG Ab findings, excluding patients having gestational diabetes mellitus”
    2. Process initiation and monitoring
    3. Refresh population. This method  collects the patient population as currently stored on the FHIR server. This explicit list of patients will be used when the population is exported, allowing subsequent runs to re-use the exact same population.
    4. Get the number of patients used for the current run.


Data validation and approval component allows an organizational data security professional to review and approve the project’s population and the de-identification profile chosen for the project, as well as to assess the de-identified data post export.

Data preparation process retrieves a research population from the FHIR server, as defined by the researcher using RESTful patient-level Queries, for further de-identification. The entire data retrieval process is designed to allow target population data export from the FHIR server according to the complex query defined by the researcher. The Bulk Export process, natively supported by the FHIR server, suffers from a lack of support for complex query execution. In order to bypass this limitation, we developed and deployed our own in-house solution. The complex query is broken down into simple, atomic queries. E.g. – loosely based on what we’ve already discussed: first query – “All female patients between 20 and 30 years old”; second – “gave birth in last five years”. Each query populates a “Group” resource. We then merge the three sub-populations into a single, coherent, population Group.

De-identification profile repository keeps the de-identification profile (mask) used for data de-identification approved by the organizational “Helsinki data committee.” The profile repository can store multiple de-identification profiles per research project.

De-identification process aimed to apply a de-identification mask on the identified data previously retrieved from the FHIR server for further approval by the organization.

Research Storage stores de-identified ndJSON files used as a source of data for the federated learning server.

Federated Learning. The Federated machine learning technique trains an algorithm across multiple organizations holding data samples retrieved from research storage. As a result, the “Model Results” exchange between partner organizations participating in the research is needed in the scope of particular research.  Data exchange between different organizations is performed by a secure Proxy server installed in the DMZ of the organizational network.

4.     Proxy server used for a secure “Model Results” data exchange between Federated Learning nodes located at partner organizations. The data exchange process is managed through secure REST API calls. 

More To Explore