RED-ML

Reduction of high volume experimental data using machine learning

Started
September 1, 2021
Status
Completed
Share this project

Abstract

PSI operates key research facilities like the Swiss Light Source (SLS), Swiss X-Ray Free-Electron Laser (SwissFEL) with 4,911 user visits in 2018. Advances in accelerator and detector technologies are generating massive data volumes for Serial Crystallography (SX) experiments with several petabytes of data generated per year and more coming with the upgrade to the SLS 2.0.

This data surge challenges the entire data management and processing chain, from detector readout to data archiving. PSI collaborates with the Swiss National Supercomputing Centre (CSCS) on scalable solutions for managing multi-petabyte data, focusing on reducing data movement overheads. CSCS provides high-performance computing for various scientific purposes. PSI aims to work with SDSC and CSCS on data reduction using machine learning for SX, which will help control infrastructure, networking, storage, and computing costs. Machine learning can enhance data reduction beyond traditional methods, aiding in image classification, region identification, and automated segmentation.

People

Collaborators

SDSC Team:
Luis Barba Flores
Benjamin Béjar Haro

PI | Partners:

PSI, Center for Computing, Theory and Data:

  • Dr. Ashton, Alun
  • Dr. Janousch, Markus
  • Dr. Leonarski, Filip
  • Dr. Gasparotto, Piero
  • Dr. Wojdyla, Justyna
  • Dr. Assmann, Greta
  • Dr. Alam, Sadaf

More info

description

Motivation

Serial Crystallography (SX) in structural biology involves collecting diffraction data from numerous micro- or nano-crystals to assemble a complete dataset. Unlike traditional crystallography, which uses a single large crystal, SX collects diffraction patterns from many small, randomly oriented crystals and merges them to determine the 3D structure of macromolecules.

SX is valuable for studying proteins that don’t form large, high-quality crystals and for capturing ultrafast dynamic processes in time-resolved experiments. It involves illuminating numerous protein crystals with high-intensity X-ray beams at Free Electron Lasers (FEL) and synchrotrons, generating massive datasets of thousands to millions of diffraction patterns.

To obtain the final structure, each diffraction pattern must be indexed (i.e., determine crystal orientation) and merged into a complete dataset. Recent advancements in X-ray detectors, like the JUNGFRAU 4-megapixel (4M) detector, which operates at a maximum frame rate of 2kHz, have revolutionized macromolecular crystallography (MX). However, these advancements present challenges for real-time indexing algorithms and user feedback due to the enormous data volumes. Real-time feedback can help classify frames with valuable information, alleviating data storage issues.

Figure 1: Schematic of serial crystallography: An X-ray beam illuminates a jet of uniformly distributed protein crystals in a viscous medium. The detector captures diffraction images, with advanced technology enabling data rates over 1 kHz, producing terabytes of time-series data stored in HDF5 files. Initial processing identifies distinct signals and strong reflections. Indexing then associates spots with Miller indices for merging and integration.

Proposed Approach / Solution

We introduce the TORO (TOrch-powered Robust Optimization) indexer, a robust and adaptable indexing algorithm developed using the PyTorch framework. The proposed method relies on efficient sampling of the space of rotations, combined with robust regression with outlier rejection. For the robust regression part, we propose an annealing strategy to solve a sequence of robust estimation problems, akin to Trimmed Least Squares, until the error is below a prescribed threshold. The problem to solve is of combinatorial in nature, and the proposed method iteratively updates (in a greedy fashion) the support set of inlier measurements that are finally used for determining the crystal orientation. Besides state-of-the art estimation performance, the TORO indexer also enjoys nice computational and algorithmic features and is capable of operating on GPUs, CPUs, and other hardware accelerators supported by PyTorch, ensuring compatibility with a wide variety of computational setups. In our tests, TORO outpaces existing solutions indexing thousands of frames per second when running on GPUs, positioning it as an attractive candidate to produce real-time indexing and user feedback. Our algorithm streamlines some of the ideas introduced by previous indexers like DIALS real grid search and XGandalf, and refines them using faster and principled robust optimization techniques, which result in a concise codebase consisting of less than 500 lines.

Figure 2: A schematic representation of the full TORO pipeline. Users can use TORO as a standalone Python module or use TorchScript to run it in a non-Python environment such as CrystFEL.

Impact

Based on our evaluations across four proteins, TORO consistently matches and, in certain instances, outperforms established algorithms such as XGandalf and MOSFLM, occasionally amplifying the quality of the consolidated data while achieving indexing rates that are orders of magnitude higher. The inherent modularity of TORO and the versatility of Pytorch code bases, facilitate its deployment into a wide array of architectures, software platforms, and bespoke applications, highlighting its prospective significance in serial crystallography. The method has been successfully tested at both the Swiss Light Source (synchrotron) and Swiss FEL at PSI, as well as in other facilities in Europe including MAX IV, DESY, and ALBA.

Gallery

Annexe

Publications

Additional resources

Bibliography

  1. Ke, T.-W., Brewster, A. S., Yu, S. X., Ushizima, D., Yang, C., & Sauter, N. K. (2018). A convolutional neural network-based screening tool for X-ray serial crystallography. Journal of Synchrotron Radiation, 25(4), 655-670. https://doi.org/10.1107/S1600577518005628
  2. Leonarski, F., Redford, S., Mozzanica, A., Lopez-Cuenca, C., Panepucci, E., Nass, K., Ozerov, D., Vera, L., Olieric, V., Buntschu, D., Schneider, R., Tinti, G., Froejdh, E., Diederichs, K., Bunk, O., Schmitt, B., & Wang, M. (2018). Fast and accurate data collection for macromolecular crystallography using the JUNGFRAU detector. Nature Methods, 15(10), 799-804. https://doi.org/10.1038/s41592-018-0143-7
  3. Wojdyla, J. A., Kaminski, J. W., Panepucci, E., Ebner, S., Wang, X., Gabadinho, J., & Wang, M. (2018). DA+ data acquisition and analysis software at the Swiss Light Source macromolecular crystallography beamlines. Journal of Synchrotron Radiation, 25(2), 293-303. https://doi.org/10.1107/S1600577517018164

Publications

Related Pages

More projects

ML-L3DNDT

Completed
Robust and scalable Machine Learning algorithms for Laue 3-Dimensional Neutron Diffraction Tomography
Big Science Data

BioDetect

Completed
Deep Learning for Biodiversity Detection and Classification
Energy, Climate & Environment

IRMA

In Progress
Interpretable and Robust Machine Learning for Mobility Analysis
No items found.

FLBI

In Progress
Feature Learning for Bayesian Inference
No items found.

News

Latest news

Smartair | An active learning algorithm for real-time acquisition and regression of flow field data
May 1, 2024

Smartair | An active learning algorithm for real-time acquisition and regression of flow field data

Smartair | An active learning algorithm for real-time acquisition and regression of flow field data

We’ve developed a smart solution for wind tunnel testing that learns as it works, providing accurate results faster. It provides an accurate mean flow field and turbulence field reconstruction while shortening the sampling time.
The Promise of AI in Pharmaceutical Manufacturing
April 22, 2024

The Promise of AI in Pharmaceutical Manufacturing

The Promise of AI in Pharmaceutical Manufacturing

Innovation in pharmaceutical manufacturing raises key questions: How will AI change our operations? What does this mean for the skills of our workforce? How will it reshape our collaborative efforts? And crucially, how can we fully leverage these changes?
Efficient and scalable graph generation through iterative local expansion
March 20, 2024

Efficient and scalable graph generation through iterative local expansion

Efficient and scalable graph generation through iterative local expansion

Have you ever considered the complexity of generating large-scale, intricate graphs akin to those that represent the vast relational structures of our world? Our research introduces a pioneering approach to graph generation that tackles the scalability and complexity of creating such expansive, real-world graphs.

Contact us

Let’s talk Data Science

Do you need our services or expertise?
Contact us for your next Data Science project!