# Open and reproducible environmental science: from theory to equations and algorithms

## Computer models for generation and use of scientific understanding

Mathematical and numerical models are increasingly important for our understanding and prediction of complex interactive processes. Consider, for example, our climate system. Current understanding of the physical processes underlying air movement, heat exchange, and evaporation-condensation of water is not sufficient to predict possible effects of elevated atmospheric CO2 concentrations on wind, temperature, humidity and precipitation patterns around the globe. We need complex models that accurately represent the feedbacks between different processes and compartments to inform us how a perturbation in one component may affect other components of the coupled climate-earth surface system that are relevant to us.

## Process understanding encoded in mathematical equations and algorithms

How can our process understanding be transferred into such models? More fundamentally, what is our quantitative process understanding and where does it come from? Our understanding is ultimately gained from a growing body of observations, including experiments and environmental monitoring, and logical reasoning. The induction of general laws from data (inductive reasoning) usually leads to the formulation of **mathematical equations**, e.g. Newton’s laws of motion. The body of established, general laws can then be used to deduce additional equations that can help predict a process of interest (deductive reasoning). These equations are then translated into algorithms that represent various processes in a model and contribute to enabling predictions about quantities of interest, e.g. stream-flow trends for the next 50 years. See Figure 1 for a graphical illustration of the process. In this blog, we will focus on the steps of deducing equations from an existing body of knowledge and translating these into algorithms (see magnifying glasses in Figure 1).

## Do not let understanding get lost in translation!

The seemingly straightforward steps of building a system of equations on prior knowledge and transferring these equations into models is often not transparent and easy to reproduce. Many mathematical derivations in papers have the famous “*it follows that*” statement somewhere, introducing an equation whose origin remains entirely mysterious to the readers, even after multiple re-reading of the preceding sections. Obviously, since this has passed peer-review, it MUST be correct, so trusting the collective mental capacity of the authors, editors and reviewers, the readers proceed to assess the utility of the equations by studying the model output, which presumably looks very reasonable. Next, the readers would like to test the utility of the equations in their own model on their own data. In order to do this, it is important to understand the context in which the equations were actually used in the original model. Here comes the next problem. Even if the model code is available, the equations in the code are usually not recognisable to the general readers. If the readers are lucky, the code documentation creates connections between specific lines of code and the original equations in the paper.

## The tragedy of missing details

So far so good. For the equations or the code to be re-usable, the readers must be able to substitute their own parameter values and compute results for their own problems. This is the most tragic part of the work flow, as the meaning of the model parameters and especially their units of measurement are often not readily accessible, due to omission or implicit use of discipline-specific conventions, which may change over time. The readers use their own intuition about the meaning of parameters and the units in which they need to be entered, and if the results look plausible, they trust the model and their assumptions. However, wrong assumptions about units are often not immediately obvious and have led to epic failures in the past (e.g. http://edition.cnn.com/TECH/space/9909/30/mars.metric.02/). Furthermore, explicit consideration of units of measurement in published equations sometimes reveals mismatches that may indicate a fundamental problem in the equations and/or variable definitions (see example described below).

## Enabling transparent and traceable conversion of knowledge

At the Swiss Data Science Center, we have developed an Environmental Science using Symbolic Math (ESSM) package that allows transparent propagation of metadata about variables and equations from papers to the final code. When using this package, the user can easily access important information such as variable definitions, descriptions and units of measurement at any time. The package is built for the free software SageMath and makes use of the intuitive programming language Python. It is available on PyPI and the source code can also be accessed on zenodo and github.

Among various methods for dealing with variable and equation metadata, the ESSM package also provides a built-in algorithm to check for consistency of units when formulating equations. This task, rightfully considered self-evident in any derivations of physical equations, is often omitted in the literature, as seen for example in the famous paper by Priestley and Taylor (1972), which we use below to illustrate the utility of the framework along the line of thought presented above.

### Example: Derivation of the Priestley-Taylor equation (Priestley and Taylor, 1972)

A key step in the derivation of the Priestley-Taylor equation is Equation 3, shown in a screen shot of the paper (Fig. 2).

Since the units of the variables were not specified in the paper, we make informed guesses based on the description in the text and widespread literature conventions:

View raw vardef.ipynb hosted with ❤ by GitHub

Our variable definitions in a table:

Using the variables defined above, we can write Equation 3 in Priestey and Taylor (1972) as a symbolic expression and verify visually that it is consistent with the formulation shown in the screen shot above:

Now, we will try to use the above expression to define a physical equation representing Equation 3 in the paper:

The package returns an error informing us that the left-hand-side of the equation is non-dimensional, while the right-hand-side has units of *kg m ^{-3}*. Clearly, the units of Equation 3 do not match if we use our assumptions about the units of

*L*,

*s*and

*c*. Either the equation is missing a division by a density term (units of

_{pa}*kg m*) on the right-hand-side, or one of our assumptions about the units involved was different to what the authors had in mind. In any case, if we were not aware of the problem and just substituted values for the symbols in the equation to estimate latent or sensible heat flux, we would likely get a result that has no physical meaning. It is left to the reader to investigate how the Priestley-Taylor equation was interpreted and used in the literature (over 3’000 citations!). An automated extraction and analysis of equations and variable definitions from such a high number of papers is a separate problem that can be tackled with data science methods, but this is outside the scope of this Articles.

^{-3}## Become part of the new movement for open and re-usable science!

The scientific community is becoming more and more aware of the advantages of open and re-usable science. (Just search the web for “open science” to get an impression). Whereas many initiatives focus on open data, the initiative presented here focuses on open and re-usable encodings of theory. The general workflow of (re-)producing algebraic derivations in a traceable way and injecting the resulting equations into quantitative computer models has already been used in scientific publications (e.g. Schymanski and Or, 2017; Schymanski, Breitenstein and Or, 2017), which are freely available online, along with the underlying data and code (https://doi.org/10.5281/zenodo.241259, https://doi.org/10.5281/zenodo.241217). The ESSM package is designed to greatly facilitate this approach and provide a blueprint for self-consistent and traceable analysis of quantitative problems involving physical variables. Please try it out and give feedback (bug reports, feature requests, questions) at https://github.com/environmentalscience/essm.

## Co-authors

With intellectual input by:

- Jiri Kuncar
- Eric Bouillet
- Olivier Verscheure
- Dorina Thanou
- Jasmin Pierlorz

## Bibliography

- Clark, M. P., Schaefli, B., Schymanski, S. J., Samaniego, L., Luce, C. H., Jackson, B. M., Freer, J. E., Arnold, J. R., Moore, R. D., Istanbulluoglu, E. and Ceola, S.: Improving the theoretical underpinnings of process-based hydrologic models, Water Resour. Res., 52(3), 2350–2365, doi:10.1002/2015WR017910, 2016.
- Priestley, C. H. B. and Taylor, R. J.: On the Assessment of Surface Heat Flux and Evaporation Using Large-Scale Parameters, Monthly Weather Review, 100(2), 81–92, doi:10.1175/1520-0493(1972)100<0081:OTAOSH>2.3.CO;2, 1972.
- Schymanski, S. J., Breitenstein, D. and Or, D.: Technical note: An experimental setup to measure latent and sensible heat fluxes from (artificial) plant leaves, Hydrol. Earth Syst. Sci. Discuss., 2017, 1–40, doi:10.5194/hess-2016-643, 2017.
- Schymanski, S. J. and Or, D.: Leaf-scale experiments reveal an important omission in the Penman–Monteith equation, Hydrol. Earth Syst. Sci., 21(2), 685–706, doi:10.5194/hess-21-685-2017, 2017.

## About the author

## More articles

### Le Temps | How Big Data shapes our lives [In French]

### Le Temps | How Big Data shapes our lives [In French]

### DeepEphys | Identifying biomarkers of Parkinson's disease with pluripotent stem cells

### DeepEphys | Identifying biomarkers of Parkinson's disease with pluripotent stem cells

### A simple dashboard facilitates the work of parliamentary services

### A simple dashboard facilitates the work of parliamentary services

## Contact us

## Let’s talk Data Science

Do you need our services or expertise?

Contact us for your next Data Science project!