RED-ML
Reduction of high volume experimental data using machine learning
Abstract
PSI operates key research facilities like the Swiss Light Source (SLS), Swiss X-Ray Free-Electron Laser (SwissFEL) with 4,911 user visits in 2018. Advances in accelerator and detector technologies are generating massive data volumes for Serial Crystallography (SX) experiments with several petabytes of data generated per year and more coming with the upgrade to the SLS 2.0.
This data surge challenges the entire data management and processing chain, from detector readout to data archiving. PSI collaborates with the Swiss National Supercomputing Centre (CSCS) on scalable solutions for managing multi-petabyte data, focusing on reducing data movement overheads. CSCS provides high-performance computing for various scientific purposes. PSI aims to work with SDSC and CSCS on data reduction using machine learning for SX, which will help control infrastructure, networking, storage, and computing costs. Machine learning can enhance data reduction beyond traditional methods, aiding in image classification, region identification, and automated segmentation.
People
Collaborators
Luis Barba Flores joined the SDSC in 2022 as Senior Data Scientist. He received a joined PhD in Computers Science in 2016 from the Université Libre de Bruxelles and Carleton University. He served as a postdoctoral researcher at ETH Zurich from 2016 to 2019, and then moved to EPFL Lausanne to work in the Machine Learning and Optimization Group until 2022. His research interests include distributed optimization algorithms, first-order optimization methods and their applications to Deep Learning models.
Benjamín Béjar received a PhD in Electrical Engineering from Universidad Politécnica de Madrid in 2012. He served as a postdoctoral fellow at École Polytechnique Fédérale de Lausanne until 2017, and then he moved to Johns Hopkins University where he held a Research Faculty position until Dec. 2019. His research interests lie at the intersection of signal processing and machine learning methods, and he has worked on topics such as sparse signal recovery, time-series analysis, and computer vision methods with special emphasis on biomedical applications. Since 2021, Benjamin leads the SDSC office at the Paul Scherrer Institute in Villigen.
PI | Partners:
PSI, Center for Computing, Theory and Data:
- Dr. Ashton, Alun
- Dr. Janousch, Markus
- Dr. Leonarski, Filip
- Dr. Gasparotto, Piero
- Dr. Wojdyla, Justyna
- Dr. Assmann, Greta
- Dr. Alam, Sadaf
description
Motivation
Serial Crystallography (SX) in structural biology involves collecting diffraction data from numerous micro- or nano-crystals to assemble a complete dataset. Unlike traditional crystallography, which uses a single large crystal, SX collects diffraction patterns from many small, randomly oriented crystals and merges them to determine the 3D structure of macromolecules.
SX is valuable for studying proteins that don’t form large, high-quality crystals and for capturing ultrafast dynamic processes in time-resolved experiments. It involves illuminating numerous protein crystals with high-intensity X-ray beams at Free Electron Lasers (FEL) and synchrotrons, generating massive datasets of thousands to millions of diffraction patterns.
To obtain the final structure, each diffraction pattern must be indexed (i.e., determine crystal orientation) and merged into a complete dataset. Recent advancements in X-ray detectors, like the JUNGFRAU 4-megapixel (4M) detector, which operates at a maximum frame rate of 2kHz, have revolutionized macromolecular crystallography (MX). However, these advancements present challenges for real-time indexing algorithms and user feedback due to the enormous data volumes. Real-time feedback can help classify frames with valuable information, alleviating data storage issues.
Proposed Approach / Solution
We introduce the TORO (TOrch-powered Robust Optimization) indexer, a robust and adaptable indexing algorithm developed using the PyTorch framework. The proposed method relies on efficient sampling of the space of rotations, combined with robust regression with outlier rejection. For the robust regression part, we propose an annealing strategy to solve a sequence of robust estimation problems, akin to Trimmed Least Squares, until the error is below a prescribed threshold. The problem to solve is of combinatorial in nature, and the proposed method iteratively updates (in a greedy fashion) the support set of inlier measurements that are finally used for determining the crystal orientation. Besides state-of-the art estimation performance, the TORO indexer also enjoys nice computational and algorithmic features and is capable of operating on GPUs, CPUs, and other hardware accelerators supported by PyTorch, ensuring compatibility with a wide variety of computational setups. In our tests, TORO outpaces existing solutions indexing thousands of frames per second when running on GPUs, positioning it as an attractive candidate to produce real-time indexing and user feedback. Our algorithm streamlines some of the ideas introduced by previous indexers like DIALS real grid search and XGandalf, and refines them using faster and principled robust optimization techniques, which result in a concise codebase consisting of less than 500 lines.
Impact
Based on our evaluations across four proteins, TORO consistently matches and, in certain instances, outperforms established algorithms such as XGandalf and MOSFLM, occasionally amplifying the quality of the consolidated data while achieving indexing rates that are orders of magnitude higher. The inherent modularity of TORO and the versatility of Pytorch code bases, facilitate its deployment into a wide array of architectures, software platforms, and bespoke applications, highlighting its prospective significance in serial crystallography. The method has been successfully tested at both the Swiss Light Source (synchrotron) and Swiss FEL at PSI, as well as in other facilities in Europe including MAX IV, DESY, and ALBA.
Presentation
Gallery
Annexe
Publications
- Gasparotto P, Barba L, Stadler H-C, Assmann G, Mendonça H, Ashton A, et al. TORO Indexer: A PyTorch-Based Indexing Algorithm for kHz Serial Crystallography. ChemRxiv. 2023, https://doi.org/10.26434/chemrxiv-2023-wnm9n
- TORO Indexer for serial crystallography: Reproducible Data Science | Open Research | Renku
Additional resources
Bibliography
- Ke, T.-W., Brewster, A. S., Yu, S. X., Ushizima, D., Yang, C., & Sauter, N. K. (2018). A convolutional neural network-based screening tool for X-ray serial crystallography. Journal of Synchrotron Radiation, 25(4), 655-670. https://doi.org/10.1107/S1600577518005628
- Leonarski, F., Redford, S., Mozzanica, A., Lopez-Cuenca, C., Panepucci, E., Nass, K., Ozerov, D., Vera, L., Olieric, V., Buntschu, D., Schneider, R., Tinti, G., Froejdh, E., Diederichs, K., Bunk, O., Schmitt, B., & Wang, M. (2018). Fast and accurate data collection for macromolecular crystallography using the JUNGFRAU detector. Nature Methods, 15(10), 799-804. https://doi.org/10.1038/s41592-018-0143-7
- Wojdyla, J. A., Kaminski, J. W., Panepucci, E., Ebner, S., Wang, X., Gabadinho, J., & Wang, M. (2018). DA+ data acquisition and analysis software at the Swiss Light Source macromolecular crystallography beamlines. Journal of Synchrotron Radiation, 25(2), 293-303. https://doi.org/10.1107/S1600577517018164
Publications
Related Pages
More projects
ML-L3DNDT
BioDetect
News
Latest news
Smartair | An active learning algorithm for real-time acquisition and regression of flow field data
Smartair | An active learning algorithm for real-time acquisition and regression of flow field data
The Promise of AI in Pharmaceutical Manufacturing
The Promise of AI in Pharmaceutical Manufacturing
Efficient and scalable graph generation through iterative local expansion
Efficient and scalable graph generation through iterative local expansion
Contact us
Let’s talk Data Science
Do you need our services or expertise?
Contact us for your next Data Science project!