The current landscape of secure research environments for reproducible data science
Reproducibility is important to the scientific method which informs how we architect big data research environments. Scientific workflows should be made transparent and available to other researchers. Public interest organizations wanting to benefit from evidence derived from big data need to think about providing a secure environment for research to occur when data are sensitive. Access to cloud-based distributed computing systems affords data analytics many benefits and also introduces new challenges with respect to data privacy. The complexities of data driven research in high performance computing environments can be mitigated with workflow tools that automate computational steps in a logical sequence. The multitude of applications that support a scientific workflow continue to evolve at a fast pace and cover a wide range of tasks including data lineage and provenance, code and data versioning, continuous integration and deployment, containerization, workflow pipelines and distributed processing. This report compares different systems in order to determine which approach is the most appropriate for enabling reproducible data science, collaborative research and protection of sensitive data. Advances in distributed computing and cloud-native software are converging and with the new demands of data-intensive sciences and other challenges that come with analyzing big, sensitive datasets system designs need to be responsive. A system design that builds off of established architectures for big data analysis and prefers open source, cloud agnostic solutions is going to be well positioned and flexible enough to meet fast evolving needs of researchers.
Most governments are positioned to have policy decisions better informed by evidence derived from the vast amounts of data collected when citizens engage with their services (schools, hospitals, corrections, etc). As far as governments can be entrusted with citizen data and under a mandate to serve the public interest, data science can be used to inform policy decisions while adhering to high standards of security and privacy, best practices and demonstrate scientific rigour to maximize public benefit. The credibility of research findings is established in part through peer review which includes the ability for others to reproduce the analysis. At this intersection of responsible data stewardship, reproducible data science and high performance computing is where many things can be discovered and are potentially beneficial to many different areas.
Population Data BC (Popdata), the Western Australian Data Linkage System and the Manitoba Centre for Health Policy are examples of organizations that have been engaged in research involving population data for a number of years 1. The sensitivity of the data imposes constraints on the design and implementation of the information system. Hertzman et al describe how Popdata endorses a privacy by design framework which informs their physical, technical and administrative controls. The technical system design details are limited though the secure research environment (SRE) is described as a virtual server system, including two-factor authentication and a VPN. The description of software components and configuration details are consistent with what one would expect from a Windows-based, Virtual Desktop Infrastructure (VDI). For researchers who may be accustomed to performing analysis on big datasets with their own machine and without any specialized infrastructure, a managed VDI is likely a welcome advancement and a logical ‘next step’ for research environments 2. Fiore et al identify key problems with the client side of data analytics being desktop-based and mention I/O bottlenecks, storage resource problems and the burden of customization and system management being transferred to each individual researcher 3.
Though VDI has been around for a long time and is a familiar solution for overcoming some of the challenges noted above, it also comes with it’s own set of familiar problems. In his article, Harbaugh outlines some general problems that can arise including cost, lack of configurability, admin overhead and the need for redundant servers 4. Specific problems with an SRE expressed as a VDI have to do with the cost of storage, fixed access to VMs and static allocation of compute resources. For instance, if each machine’s image has a maximum of 24GB RAM allocated to it, the amount of processing capacity on that machine is limited to that finite, static number. If any of the analytic processes running exceeds the available capacity of the virtual machine, a researcher may not be able to perform the computation necessary for analysis. Meyer et al also identify ongoing hardware and software licensing costs with a Windows-based VDI and at the same time acknowledge the general scalability and flexibility of cloud computing 5.
High Performance computing
Moving analysis applications closer to data is an important consideration in a system designed for high-availability computing. Araya et al show how JOVIAL, their Jupyter notebook-based approach to infrastructure, uses Docker, Kubernetes, Dask and Lustre to deliver a multi-user astronomical data analysis platform 6. Though the authors identify future work to improve collaboration and reproducibility features, including a publishing mechanism, the use of container-based virtualization is one example of a modern deviation from the more traditional VDI. One of the many advantages of containerization is that it comes with its own layer of security via isolation . Fiore et al include JupyterHub in their technology stack as a browser-based replacement for a desktop interface which can be used to spin up individual containers, notebooks or otherwise. Altintas et al also consider JupyterHub as an integral part of their machine learning ecosystem 7. In using Kubernetes to orchestrate containerized applications such as JupyterHub, Araya et al demonstrate a flexible system that scales horizontally and dynamically allocates compute resources to big processing jobs on an as needed basis.
Big data systems have common functional components which can be expressed as a high level abstraction which informs a generic reference architecture. Sang et al identify five components related to broad functions within a big data system: “Data source; Data Collection, Processing and Loading (CPL); Data Analysis and Aggregation; Interface and Visualization; and Job and Model Specification” 8. Specific use cases from different domains inform nuances from which more tailored reference architectures can be made. Building off the domain-agnostic NIST framework 9, Klein et al create a reference architecture with a focus in the national security domain (Figure 1) 10. The authors identify thirteen functional components and group them into three categories: cross-cutting modules, application provider modules and framework provider modules. Components from reference architectures can be mapped to specific technologies during the implementation phase of system design.
Based on the FAIR principles for guiding scientific data management established by Wilkinson et al 11, Madduri et al assess the reproducibility of their genome use case by measuring the find-ability, accessibility, interoperability and reusability of their analysis, outputs and all aspects of the data lifecycle . Since the data are not personally identifying, the information systems enable openness and sharing which has a different context and would be considered a breach in a secure research environment with sensitive data. For instance, findability and accessibility may not be desirable qualities to emulate, unless the context is limited to authenticated persons or authorized groups of researchers.
According to the authors, metadata standards and workflow definitions are ways to work towards interoperability and the use of Docker containers helps achieve a certain level of reproducibility. Fiore et al also speak to metadata and data provenance as the foundation for reproducibility . Though it’s not mentioned in the articles, Pachyderm (pachyderm.io) is an open source solution for version control of data, and concerns itself with metadata and data provenance. Github is used as a tool in at least two studies , which improves accessibility and realizes provenance of workflow in as much as workflow can be represented in code. Similar to other big data analysis, both studies speak to tooling for facilitating a multi-step workflow that streamlines or automates complex computational tasks.
Perez et al explore an event-driven, serverless computing architecture 12. Though their use case lends itself to singular purpose functions (object detection on video files) it is interesting to note, once again, the use of Docker, Kubernetes and Minio. Fiore et al describe the architecture of their Analytics-Hub as having three main open source components including JupyterHub, Ophidia High Performance Data Analytics (HPDA) framework and Synda . Altintas et al describe a system that leverages hardware acceleration via a network of distributed multi-GPU appliances and customized Docker containers that provide low-level access to the GPUs for machine learning algorithms . The authors identify PyTorch and Tensorflow as machine learning libraries that take advantage of the custom hardware. Rook is used as cloud-native storage orchestration solution and is also cloud-agnostic, maximizing portability between cloud providers.
Special requirements for development and operation
In situations where setting up on-prem cloud infrastructure is not an option, Moghadam and Fayoumi describe an encryption scheme to protect sensitive data in a third-party, cloud hosted system 13. The trade off between efficiency, security and functionality is noted with an emphasis on the importance of determining project objectives prior to implementation. Chaoui and Makdoun also recognize security and privacy issues with a cloud based implementation and make four general security recommendations for authentication, encryption, search over encrypted data and data destruction 14.
Data driven science puts unique challenges in front of system designers for secure analytics environments. At the centre of most of the solutions is Kubernetes. What follows is other container friendly, cloud-native, cloud-agnostic tools that are both Kubernetes and Linux compatible. JupyterHub is a central component in a number of studies and is also extensible which opens the door for spinning up more than notebook containers. Implications are that Windows based applications become less desirable, more costly and the fixed resources of VDI’s undesirable. The spiky processing demands of big data algorithms puts demands on a VDI that it’s not designed to handle.
While GPU appliances offer the most significant boost in hardware acceleration, it is an expensive proposition and requires coordination with customized software. Though, if the analytics environment is designed to support primarily machine learning it may be a good, though costly model to work from.
Using the reference architecture by Klein et al as a guide, we can start mapping specific technologies to the various components for a secure analytics environment that supports reproducible data science. Though the articles reviewed did not mention specific technologies that address a messaging component, it’s safe to assume that the queuing of jobs in a big data system poses integration challenges and will become more of a concern once the SRE is closer to capacity.
Big Data Application Provider Modules
Ligo (data linking)
Jupyter Notebooks (with R, Python and Scala Kernels)
|GitLab (internal to SRE instance)
Pachyderm (Data provenance)
Audit/Reporting custom admin feature
Custom, web-based download interface, protected by VPN
Big Data Framework Provider Modules
Minio object storage
|On-Prem Data Centre|
|KeyCloak (2FA with LDAP/Kerberos)
OpenID Connect (OIDC)
|Grafanas||Infrastructure as code via Terraform
Cloud-agnostic software for portability
Trusted third-party identity providers (for federation)
- C. Pencarrick Hertzman, N. Meagher, and K. M. McGrail, “Privacy by Design at Population Data BC: a case study describing the technical, administrative, and physical controls for privacy-sensitive secondary use of personal information for research in the public interest,” J. Am. Med. Inform. Assoc., vol. 20, no. 1, pp. 25–28, Jan. 2013.
- R. Madduri et al., “Reproducible big data science: A case study in continuous FAIRness,” PLoS ONE, vol. 14, no. 4, pp. 1–22, Apr. 2019.
- S. Fiore et al., “Towards an Open (Data) Science Analytics-Hub for Reproducible Multi-Model Climate Analysis at Scale,” in 2018 IEEE International Conference on Big Data (Big Data), 2018, pp. 3226–3234
- L. G. Harbaugh, “The Pros and Cons of Using Virtual Desktop Infrastructure,” PCWorld, vol. 30, no. 6, pp. 32–32, Jun. 2012.
- A. Meyer, L. Green, C. Faulk, S. Galla, and A.-M. Meyer, “Framework for Deploying a Virtualized Computing Environment for Collaborative and Secure Data Analytics,” eGEMs, vol. 4, no. 3, Aug. 2016.
- M. Araya et al., “JOVIAL: Notebook-based astronomical data analysis in the cloud,” Astron. Comput., vol. 25, pp. 110–117, Oct. 2018.
- I. Altintas et al., “Workflow-Driven Distributed Machine Learning in CHASE-CI: A Cognitive Hardware and Software Ecosystem Community Infrastructure,” in 2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), 2019, pp. 865–873.
- G. M. Sang, L. Xu, and P. de Vrieze, “A reference architecture for big data systems,” in 2016 10th International Conference on Software, Knowledge, Information Management Applications (SKIMA), 2016, pp. 370–375.
- NIST Big Data Public Working Group Reference Architecture Subgroup, “NIST Big Data Interoperability Framework: Volume 6, Reference Architecture,” National Institute of Standards and Technology, NIST SP 1500-6, Oct. 2015.
- J. Klein, R. Buglak, D. Blockow, T. Wuttke, and B. Cooper, “A Reference Architecture for Big Data Systems in the National Security Domain,” in 2016 IEEE/ACM 2nd International Workshop on Big Data Software Engineering (BIGDSE), 2016, pp. 51–57
- M. D. Wilkinson et al., “The FAIR Guiding Principles for scientific data management and stewardship,” Sci. Data, vol. 3, no. 1, p. 160018, Dec. 2016.
- A. Pérez, S. Risco, D. M. Naranjo, M. Caballer, and G. Moltó, “On-Premises Serverless Computing for Event-Driven Data Processing Applications,” in 2019 IEEE 12th International Conference on Cloud Computing (CLOUD), 2019, pp. 414–421.
- S. S. Moghadam and A. Fayoumi, “Toward Securing Cloud-Based Data Analytics: A Discussion on Current Solutions and Open Issues,” IEEE Access, vol. 7, pp. 45632–45650, 2019.
- H. Chaoui and I. Makdoun, “A new secure model for the use of cloud computing in big data analytics,” presented at the Proceedings of the Second International Conference on Internet of things, Data and Cloud Computing, 2017, p. 18.