Abstract ID: 186
Introducing the AI4S2S project; open-source python packages to make data-driven pipelines for S2S forecasting more efficient, transparent, and scalable
Lead Author: Sem Vijverberg
Vrije Universiteit, Institute for Environmental Studies, Netherlands
Keywords: open-source software, Machine Learning, S2S forecasting
Abstract: Reliable S2S forecasts remain a huge scientific challenge. The lead-time is too long such that the memory from the atmosphere’s initial condition is lost, but too short for the atmosphere’s boundary conditions to be felt strongly. Only for specific ‘windows of predictability’ (i.e. specific regions, timescales and climatic background states), skillful forecasts are possible, in an otherwise largely unpredictable future. The interest in machine learning (ML) is growing fast due to a number of successes in S2S forecasting. However, we argue there is a need for more standardization, consensus on best practices, higher efficiency, and higher reproducibility. Typical S2S ML use-cases, such as (1) pure statistical forecasting based on observations, (2) transfer learning, and (3) post-processing of dynamical model ensembles, require a large coding and preprocessing effort. Such experiments are not trivial to set up, and without sufficient experience and expertise there is a large risk of improper cross-validation and/or improper and non-standard verification.
Within a 3-year project, a dedicated team of software engineers and researchers are working on light-weight Python packages that make the construction of ML-based pipelines for S2S forecasting much more efficient, transparent, and scalable.
We developed the python package “lilio” to handle user-defined sequences of precursor and target periods, to be able to reliably and repeatedly resample raw input data to these periods. The “s2spy” package continues where “lilio” leaves off, and facilitates orchestrating full S2S machine learning pipelines, from preprocessing and cross-validation, to dimensionality reduction, model fitting and model interpretation. Flexibility for the user is an important pillar of s2spy, leaving as much flexibility as possible for the user to insert their own new methods to forecast and/or reduce dimensionality. Once such a clean ML pipeline has been designed, it becomes both more transparent, reproducible, as well as easily scalable to any HPC system and climate data platform.
The AI4S2S project aims to contribute to a higher reproducibility and works towards a wider acceptance of ML standards and best practices. We will present our vision and the capabilities of our package, show-casing that we can build a model from raw climate data up to verification in only a few lines of code.
Co-authors:
Bart Schilperoort (Netherlands eScience Center)
Yang Liu (Netherlands eScience Center)
Jannes van Ingen (Vrije Universiteit, Institute for Environmental Studies)
Peter Kalverla (Netherlands eScience Center)
Fakhereh (Sarah) Alidoost (Netherlands eScience Center)
Dim Coumou (Vrije Universiteit, Institute for Environmental Studies)