RED-ML

Reduction of high volume experimental data using machine learning

Started

September 1, 2021

Status

Completed

Share this project

PSI operates key research facilities like the Swiss Light Source (SLS), Swiss X-Ray Free-Electron Laser (SwissFEL) with 4,911 user visits in 2018. Advances in accelerator and detector technologies are generating massive data volumes for Serial Crystallography (SX) experiments with several petabytes of data generated per year and more coming with the upgrade to the SLS 2.0.

This data surge challenges the entire data management and processing chain, from detector readout to data archiving. PSI collaborates with the Swiss National Supercomputing Centre (CSCS) on scalable solutions for managing multi-petabyte data, focusing on reducing data movement overheads. CSCS provides high-performance computing for various scientific purposes. PSI aims to work with SDSC and CSCS on data reduction using machine learning for SX, which will help control infrastructure, networking, storage, and computing costs. Machine learning can enhance data reduction beyond traditional methods, aiding in image classification, region identification, and automated segmentation.

People

Collaborators

SDSC Team:

Luis Barba Flores

Senior Data Scientist

Luis Barba Flores joined the SDSC in 2022 as Senior Data Scientist. He received a joined PhD in Computers Science in 2016 from the Université Libre de Bruxelles and Carleton University. He served as a postdoctoral researcher at ETH Zurich from 2016 to 2019, and then moved to EPFL Lausanne to work in the Machine Learning and Optimization Group until 2022. His research interests include distributed optimization algorithms, first-order optimization methods and their applications to Deep Learning models.

Luis Barba Flores

Benjamín Béjar Haro

Lead Data Scientist & group leader SDSC Hub at PSI

Benjamín Béjar received a PhD in Electrical Engineering from Universidad Politécnica de Madrid in 2012. He served as a postdoctoral fellow at École Polytechnique Fédérale de Lausanne until 2017, and then he moved to Johns Hopkins University where he held a Research Faculty position until Dec. 2019. His research interests lie at the intersection of signal processing and machine learning methods, and he has worked on topics such as sparse signal recovery, time-series analysis, and computer vision methods with special emphasis on biomedical applications. Since 2021, Benjamin leads the SDSC office at the Paul Scherrer Institute in Villigen.

Benjamín Béjar Haro

PI | Partners:

PSI, Center for Computing, Theory and Data:

Dr. Ashton, Alun
Dr. Janousch, Markus
Dr. Leonarski, Filip
Dr. Gasparotto, Piero
Dr. Wojdyla, Justyna
Dr. Assmann, Greta
Dr. Alam, Sadaf

More info

description

Motivation

‍Serial Crystallography (SX) in structural biology involves collecting diffraction data from numerous micro- or nano-crystals to assemble a complete dataset. Unlike traditional crystallography, which uses a single large crystal, SX collects diffraction patterns from many small, randomly oriented crystals and merges them to determine the 3D structure of macromolecules.

SX is valuable for studying proteins that don’t form large, high-quality crystals and for capturing ultrafast dynamic processes in time-resolved experiments. It involves illuminating numerous protein crystals with high-intensity X-ray beams at Free Electron Lasers (FEL) and synchrotrons, generating massive datasets of thousands to millions of diffraction patterns.

To obtain the final structure, each diffraction pattern must be indexed (i.e., determine crystal orientation) and merged into a complete dataset. Recent advancements in X-ray detectors, like the JUNGFRAU 4-megapixel (4M) detector, which operates at a maximum frame rate of 2kHz, have revolutionized macromolecular crystallography (MX). However, these advancements present challenges for real-time indexing algorithms and user feedback due to the enormous data volumes. Real-time feedback can help classify frames with valuable information, alleviating data storage issues.

***Figure 1:*** Schematic of serial crystallography: An X-ray beam illuminates a jet of uniformly distributed protein crystals in a viscous medium. The detector captures diffraction images, with advanced technology enabling data rates over 1 kHz, producing terabytes of time-series data stored in HDF5 files. Initial processing identifies distinct signals and strong reflections. Indexing then associates spots with Miller indices for merging and integration.

Proposed Approach / Solution

‍We introduce the TORO (TOrch-powered Robust Optimization) indexer, a robust and adaptable indexing algorithm developed using the PyTorch framework. The proposed method relies on efficient sampling of the space of rotations, combined with robust regression with outlier rejection. For the robust regression part, we propose an annealing strategy to solve a sequence of robust estimation problems, akin to Trimmed Least Squares, until the error is below a prescribed threshold. The problem to solve is of combinatorial in nature, and the proposed method iteratively updates (in a greedy fashion) the support set of inlier measurements that are finally used for determining the crystal orientation. Besides state-of-the art estimation performance, the TORO indexer also enjoys nice computational and algorithmic features and is capable of operating on GPUs, CPUs, and other hardware accelerators supported by PyTorch, ensuring compatibility with a wide variety of computational setups. In our tests, TORO outpaces existing solutions indexing thousands of frames per second when running on GPUs, positioning it as an attractive candidate to produce real-time indexing and user feedback. Our algorithm streamlines some of the ideas introduced by previous indexers like DIALS real grid search and XGandalf, and refines them using faster and principled robust optimization techniques, which result in a concise codebase consisting of less than 500 lines.

***Figure 2:*** *A schematic representation of the full TORO pipeline. Users can use TORO as a standalone Python module or use TorchScript to run it in a non-Python environment such as CrystFEL.*

Impact

‍Based on our evaluations across four proteins, TORO consistently matches and, in certain instances, outperforms established algorithms such as XGandalf and MOSFLM, occasionally amplifying the quality of the consolidated data while achieving indexing rates that are orders of magnitude higher. The inherent modularity of TORO and the versatility of Pytorch code bases, facilitate its deployment into a wide array of architectures, software platforms, and bespoke applications, highlighting its prospective significance in serial crystallography. The method has been successfully tested at both the Swiss Light Source (synchrotron) and Swiss FEL at PSI, as well as in other facilities in Europe including MAX IV, DESY, and ALBA.

Presentation

Download Presentation



Gallery

Annexe

Code

TORO Indexer for serial crystallography: Reproducible Data Science | Open Research | Renku

Additional resources

—



Bibliography

Ke, T.-W., Brewster, A. S., Yu, S. X., Ushizima, D., Yang, C., & Sauter, N. K. (2018). A convolutional neural network-based screening tool for X-ray serial crystallography. Journal of Synchrotron Radiation, 25(4), 655-670. https://doi.org/10.1107/S1600577518005628
Leonarski, F., Redford, S., Mozzanica, A., Lopez-Cuenca, C., Panepucci, E., Nass, K., Ozerov, D., Vera, L., Olieric, V., Buntschu, D., Schneider, R., Tinti, G., Froejdh, E., Diederichs, K., Bunk, O., Schmitt, B., & Wang, M. (2018). Fast and accurate data collection for macromolecular crystallography using the JUNGFRAU detector. Nature Methods, 15(10), 799-804. https://doi.org/10.1038/s41592-018-0143-7
Wojdyla, J. A., Kaminski, J. W., Panepucci, E., Ebner, S., Wang, X., Gabadinho, J., & Wang, M. (2018). DA+ data acquisition and analysis software at the Swiss Light Source macromolecular crystallography beamlines. Journal of Synchrotron Radiation, 25(2), 293-303. https://doi.org/10.1107/S1600577517018164

Publications



Gasparotto, P.; Barba, L.; Stadler, H.; Assmann, G.; Mendonça, H.; Ashton, A. W.; Janousch, M.; Leonarski, F.; Béjar, B. "TORO Indexer: a PyTorch-based indexing algorithm for kilohertz serial crystallography" Journal of Applied Crystallography 57 4 931-944 2024 View publication 