Machine learning and data assimilation

Machine learning and data assimilation

by Rossella Arcucci

Imagine a world where it is possible to accurately predict the weather, climate, storms, tsunami and other computational intensive problems in real time from your laptop or even mobile phone – if one has access to a supercomputer then to be able to predict at unprecedented scales/detail. This is the long term aim of our work on Data Assimilation with Machine Learning at the Data Science Institute (Imperial College London, UK) and as such, we believe, it will be a key component of future Numerical Forecasting systems.

We proved that the integration of machine learning with Data assimilation can increase the reliability of prediction, reducing errors by including information with an actual physical meaning from observed data. The resulting cohesion of machine learning and data assimilation is then blended in a future generation of fast and more accurate predictive models. This integration is based on the idea of using machine learning to learn the past experiences of an  assimilation process. This follows the principle of Bayesian approach.

Edward Norton Lorenz stated “small causes can have larger effects”, the so called butterfly effect. Imagine a world where it is possible to catch “small causes” in real time and predict effects in real time as well. To know to act! A world where science works with continuously learning from observation.

Figure 1. Comparison of the Lorenz system trajectories obtained by the use of Data Assimilation (DA) and by the integration of machine learning with Data assimilation (DA+NN)

Using ‘flood-excess volume’ to quantify and communicate flood mitigation schemes

Using ‘flood-excess volume’ to quantify and communicate flood mitigation schemes

by Tom Kent

  1. Background

Urban flooding is a major hazard worldwide, brought about primarily by intense rainfall and exacerbated by the built environment we live in. Leeds and Yorkshire are no strangers when it comes to the devastation wreaked by such events.  The last decade alone has seen frequent flooding across the region, from the Calder Valley to the city of York, while the Boxing Day floods in 2015 inundated central Leeds with unprecedented river levels recorded along the Aire Valley. The River Aire originates in the Yorkshire Dales and flows roughly eastwards through Leeds before merging with the Ouse and Humber rivers and finally flowing into the North Sea. The Boxing Day flood resulted from record rainfall in the Aire catchment upstream of Leeds. To make matters worse, near-record rainfall in November meant that the catchment was severely saturated and prone to flooding in the event of more heavy rainfall. The ‘Leeds City Region flood review’ [1] subsequently reported the scale of the damage: Over 4,000 homes and almost 2,000 businesses were flooded with the economic cost to the City Region being over half a billion pounds, and the subsequent rise in river levels allowed little time for communities to prepare.”

The Boxing Day floods and the lack of public awareness around the science of flooding led to the idea and development of the flood-demonstrator ‘Wetropolis’ (see Onno Bokhove’s previous DARE blog post). Wetropolis is a tabletop model of an idealised catchment that illustrates how extreme hydroclimatic events can cause a city to flood due to peaks in groundwater and river levels following random intense rainfall, and in doing so conceptualises the science of flooding in a way that is accessible to and directly engages the public. It also provides a scientific testing environment for flood modelling, control and mitigation, and data assimilation, and has inspired numerous discussions with flood practitioners and policy makers.

These discussions led us in turn to reconsider and analyse river flow data as a basis for assessing and quantifying flood events and various potential and proposed flood-mitigation measures. Such measures are generally engineering-based (e.g., storage reservoirs, defence walls) or nature-based (e.g., tree planting and peat restoration, ‘leaky’ woody-debris dams); a suite of these different measures constitutes a catchment- or city-wide flood-mitigation scheme. We aim to communicate this analysis and resulting flood-mitigation assessment in a concise and straightforward manner in order to assist decision-making for policy makers (e.g., city councils and the Environment Agency) and inform the general public.

  1. River data analysis and ‘flood excess volume’

Rivers in the UK are monitored by a dense network of gauges that measure and record the river level (also known as water stage/depth) – typically every 15 minutes – at the gauge location. There are approximately 1500 gauging stations in total and the flow data are collated by the Environment Agency and freely available to download. Shoothill’s GaugeMap website ( provides an excellent tool for visualising this data in real-time and browsing historic data in a user-friendly manner. Flood events are often characterised by their peak water level, i.e. the maximum water depth reached during the flood, and statistical return periods. However, this flood-peak conveys neither the duration or the volume of the flood, and the meaning of return period is often difficult to grasp for non-specialists. Here, we analyse river-level data from the Armley gauge station – located 2km upstream from Leeds city centre – and demonstrate the concept of ‘flood-excess volume’ as an alternative diagnostic for flood events.

The bottom-left panel of Figure 1 (it may help to tilt your head left!) shows the river level (h, in metres) as a function of time in days around Boxing Day 2015. The flood peaked at 5.21m overnight on the 26th/27th December, rising over 4m in just over 24 hours. Another quantity of interest in hydrology is the discharge (Q), or flow rate, the volume of water passing a location per second. This is usually not measured directly but can be determined via a rating curve, a site-specific empirical function Q = Q(h) that relates the water level to discharge. Each gauge station has its own rating curve which is documented and updated by the Environment Agency. The rating curve for Armley is plotted here in the top-left panel (solid curve) with the dashed line denoting its linear approximation; the shaded area represents the estimated error in the relationship, which is expected to grow considerably when in flood (i.e., for high values of h). Applying the rating curve to the river level data yields the discharge time series (top-right panel, called a hydrograph) for Armley. Note that the rating curve error means that the discharge time series has some uncertainty (grey shaded zone around the solid curve). We see that the peak discharge is 330-360m3/s, around 300m3/s higher than 24 hours previously. Since discharge is the volume of water per second, the area under the discharge curve is the total volume of water. To define the flood-excess volume, we introduce a threshold height hT above which flooding occurs. For this flood event, local knowledge and photographic evidence suggested that flooding commenced when river levels exceeded 3.9m, so here we choose the threshold hT = 3.9m. This is marked as a vertical dotted line on the left panels: following it up to the rating curve, one obtains a threshold discharge QT = Q(hT) = 219.1m3/s (horizontal dotted line). The flood-excess volume (FEV) is the blue shaded area between the discharge curve and the threshold discharge QT. Put simply, this is the volume of water that caused flooding, and therefore the volume of flood water one seeks to mitigate (i.e., reduce to zero) by the cumulative effect of various flood mitigation measures. The FEV, here around 9.34 million cubic metres, has a corresponding flood duration Tf = 32 hours, which is the time between the river level first exceeding hT and subsequently dropping below hT.  The rectangle represents the mean approximation to the FEV, which, in the absence of frequent flow data, can be used to estimate the FEV (blue shaded area) based on a mean water level (hm) and discharge (Qm).

  1. Using FEV in flood-mitigation assessment

Having defined FEV in this way, we are motivated by the following questions: (i) how can we articulate FEV (which is often many million cubic metres) in a more comprehensible manner? And (ii) what fraction of the FEV is reduced, and at what cost, by a particular flood-mitigation measure? Our simple yet powerful idea is to express the FEV as a 2-metre-deep square ‘flood-excess lake’ with side-length on the order of a kilometer. For example, we can break down the FEV for Armley as follows: 9.34Mm3 = (21502 x 2)m3, which is a 2-metre-deep lake with side-length 2.15km. This is immediately easier to visualise and goes some way to conveying the magnitude of the flood. Since the depth is shallow relative to the side-length, we can view this ‘flood-excess lake’ from above as a square and ask what fraction of this lake is accounted for by the potential storage capacity of flood-mitigation measures. The result is a graphical tool that (i) contextualises the magnitude of the flood relative to the river and its valley/catchment and (ii) facilitates quick and direct assessment of the contribution and value of various mitigation measures.

Figure 2 shows the Armley FEV as a 2m-deep ‘flood-excess lake’ (not to scale). Given the size of the lake as well as the geography of the river valley concerned, one can begin to make a ballpark estimate of the contribution and effectiveness of flood-plain enhancement for flood storage and other flood-mitigation measures. Superimposed on the bird’s-eye view of the lake in figure 3 are two scenarios from our hypothetical Leeds Flood Alleviation Scheme II (FASII+) that comprise: (S1) building flood walls and using a flood-water storage site at Calverley; and (S2) building (lower) flood walls and using a flood-water storage site at Rodley.

The available flood-storage volume is estimated to be 0.75Mm3 and 1.1Mm3 at Calverley and Rodley respectively, corresponding to 8% and 12% of the FEV. The absolute cost of each measure is incorporated, as well as the value (i.e., cost per 1% of FEV mitigated), while the overall contribution in terms of volume is simply the fraction of the lake covered by each measure. It is immediately evident that both schemes provide 100% mitigation and that (S1) provides better value (£0.75M/1% against £0.762M/1%). We can also see that although storage sites offer less value than building flood walls, a larger storage site allows lower flood walls to be built which may be an important factor for planning departments. In this case, although (S2) is more expensive overall, the Rodley storage site (£1.17M/1%) is better value than Calverley storage site (£1.25M/1%) and means that flood walls are lower. It is then up to policy-makers to make the best decision based on all the available evidence and inevitable constraints. Our hypothetical FASII+ comprises 5 scenarios in total and is reported in [2].

The details are in some sense of secondary importance here; the take-home message is that the FEV analysis offers a protocol to optimise the assessment of mitigation schemes, including cost-effectiveness, in a comprehensible way. In particular, the graphical presentation of the FEV as partitioned flood-excess lakes facilitates quick and direct interpretation of competing schemes and scenarios, and in doing so communicates clearly the evidence needed to make rational and important decisions. Finally, we stress that FEV should be used either prior to or in tandem with more detailed hydrodynamic numerical modelling; nonetheless it offers a complementary way of classifying flood events and enables evidence-based decision-making for flood-mitigation assessment. For more information, including case studies in the UK and France, see [2,3,4]; summarised in [5].


[1] West Yorkshire combined Authority 2016. Leeds city region flood review report. December 2016.

[2] O. Bokhove, M. Kelmanson, T. Kent (2018a): On using flood-excess volume in flood mitigation, exemplified for the River Aire Boxing Day Flood of 2015. Subm. evidence-synthesis article: Proc. Roy. Soc. A. See also:

[3] O. Bokhove, M. Kelmanson, T. Kent, G. Piton, J.-M. Tacnet (2018b): Communicating nature-based solutions using flood-excess volume for three UK and French river floods. In prep. See also the preliminary version on:

[4] O. Bokhove, M. Kelmanson, T. Kent (2018c): Using flood-excess volume in flood mitigation to show that upscaling beaver dams for protection against extreme floods proves unrealistic. Subm. evidence-synthesis article: Proc. Roy. Soc. A. See also:

[5] ‘Using flood-excess volume to assess and communicate flood-mitigation schemes’, poster presentation for ‘Evidence-based decisions for UK Landscapes’, 17-18 September 2018, INI, Cambridge. Available here:

Workshop on Sensitivity Analysis and Data Assimilation in Meteorology and Oceanography

Workshop on Sensitivity Analysis and Data Assimilation in Meteorology and Oceanography

by Fabio L. R. Diniz

I attended the Workshop on Sensitivity Analysis and Data Assimilation in Meteorology and Oceanography, also known as Adjoint Workshop, which took place in Aveiro, Portugal between 1st and 6th July 2018. This opportunity was given to me due to funding for early career researchers from the Engineering and Physical Sciences Research Council (EPSRC) Data Assimilation for the Resilient City (DARE) project in the UK. All recipients of this fund that were participating for the first time in the workshop were invited to attend the pre-workshop day of tutorials, presenting sensitivity analysis and data assimilation fundamentals geared to the early career researchers. I would like to thank to EPSRC DARE award committee and the organizers of the Adjoint Workshop for finding me worthy of this award.

Currently I’m a post graduate student at the Brazilian National Institute for Space Research (INPE) and have been visiting the Global Modeling and Assimilation Office (GMAO) of the American National Aeronautics and Space Administration (NASA) for almost one year as part of my PhD comparing two approaches to obtain what is known as the observation impact measure. This measure is obtained as a direct application of sensitivity in data assimilation and basically is a measure of how much each observation helps to improve the short-range forecasts. In Meteorology, specifically in numerical weather prediction, these observations are represented by the global observing system, which includes observations obtained from a number of in situ (e.g., radiosondes, and surface observations) and remote sensed observations (e.g., satellite sensors). During my visit, I’ve been working under the supervision of Ricardo Todling from NASA/GMAO comparing results from two strategies for assessing the impact of observations on forecasts using data assimilation system available at NASA/GMAO: one based on the traditional adjoint technique, another based on ensembles. Preliminary results from this comparison were presented during the Adjoint Workshop.

The Adjoint Workshop provided a perfect environment for early career researchers interact with experts in the field from all around the world. The attendance at the workshop has helped me engage healthy discussions about my work and data assimilation in general. The full programme with abstracts and presentations is available at the workshop web site:

Thanks to everyone who contributed to this workshop.

Accounting for Unresolved Scales Error with the Schmidt-Kalman Filter at the Adjoint Workshop

Accounting for Unresolved Scales Error with the Schmidt-Kalman Filter at the Adjoint Workshop

by Zak Bell

This summer I was fortunate enough to receive funding from the DARE training fund to attend the 11th workshop on sensitivity analysis and data assimilation in meteorology and oceanography. This workshop, also known as the adjoint workshop, provides academics and students with an occasion to present their research of the inclusion of Earth observations into mathematical models. Due to the friendly environment of the workshop, I was presented with an excellent opportunity to condense a portion of my research into a poster and discuss it with other attendees at the workshop.

Data assimilation is essentially a way to link theoretical models of the world to the actual world. This is achieved by finding the most likely state of a model through observations of it. A state for numerical weather prediction will typically be comprised of variables such as wind, moisture and temperature at a specific time. One way to assimilate observations is through the Kalman Filter. The Kalman Filter assimilates one observation at a time and through consideration of the errors of our models, computations and observations we can determine the most probable state of our model and use this state to better model or forecast the real world.

It goes without saying that a better understanding of the errors involved in the observations would lead to a better forecast. Therefore, research into observation errors is a large and ongoing area of interest. My research is on observation error due to unresolved scales in data assimilation which can be broadly described as the difference between what an observation actually observes and a numerical model’s representation of that observation. For example, an observation taken in a sheltered street of a city will have a different value than a numerical model of that city unable to individually represent the spatial scales of each street. To utilize such observations within data assimilation, the unresolved spatial scales must be accounted for in some way.  The method I chose to create a poster for was the Schmidt-Kalman Filter which was originally developed for navigation purposes but has since been the subject of a few studies within the meteorology community on unresolved scales error.

The Schmidt-Kalman Filter accounts for the state- and time-dependence of the error due to unresolved scales through use of the statistics of the unresolved scales. However, to save on computational expense, the unresolved state values will be disregarded. My poster presented a mathematical analysis of a simple example for the Schmidt-Kalman Filter and highlighted its ability to compensate for unresolved scales error. The Schmidt-Kalman filter performs better than a Kalman Filter for just the resolved scales but worse than a Kalman Filter that resolves all scales which is to be expected. Using the feedback from the other attendees and ideas obtained from other presentations at the workshop I will continue to investigate the properties of the Schmidt-Kalman Filter as well as its suitability for urban weather prediction.

Producing the best weather forecasts by using all available sources of information

Producing the best weather forecasts by using all available sources of information

Jemima M. Tabeart is an PhD student at the University of Reading in the Mathematics of Planet Earth Centre for Doctoral Training, she has received funding from the DARE  training fund to attend Data Assimilation tutorials at the  Workshop on Sensitivity Analysis and Data Assimilation in Meteorology and Oceanography, 1-6 July 2018, Aveiro, Portugal. Here she writes about  her research work.

In order to produce the best weather forecast possible, we want to make use of all available sources of information. This means combining observations of the world around us at the current time with a computer model that can fill in the gaps where we have no observations, by using known laws of physics to evolve observations from the past. This combination process is called data assimilation, and our two data sources (the model and observations) are weighted by our confidence in how accurate they are. This means that knowledge about errors in our observations is really important for getting good weather forecasts. This is especially true where we expect errors between different observations to be related, or correlated.

Caption: An image of the satellite MetOp-B which hosts IASI (Infrared Atmospheric Sounding Interferometer) – an instrument that I have been using as an example to test new mathematical techniques to allow correlated errors to be used inexpensively in the Met Office system.  Credit: ESA AOES Medialab MetOp-B image.

Why do such errors occur? No observation will be perfect: there might be biases (e.g. a thermometer that measures everything 0.5℃ too hot), we might not be measuring variables that are used in a numerical model, and converting observations introduces an error (this is the case with satellite observations), and we might be using high density observations that can detect phenomena that our model cannot (e.g. intense localised rainstorms might not show up if our model can only represent objects larger than 5km). Including additional observation error correlations means we can use observation data more intelligently and even extract extra information, leading to improvements in forecasts.

However, these observation error correlations cannot be calculated directly – we instead have to estimate them. Including these estimates in our computations is very expensive, so we need to find ways of including this useful error information in a way that is cheap enough to produce new forecasts every 6 hours! I research mathematical techniques to adapt error information estimates for use in real-world systems.

Caption: Error correlation information for IASI instrument. Dark colours indicate stronger relationships between errors for different channels of the instrument – often strong relationships occur between variables that measure similar things. We want to keep this structure, but change the values in a way that makes sure our computer system still runs quickly.

At the workshop I’ll be presenting new work that tests some of these methods using the Met Office system. Although we can improve the time required for our computations, using different error correlation information alters other parts of the system too! As we don’t know “true” values, it’s hard to know whether these changes are good, bad or just different. I’m looking forward to talking with scientists from other organisations who understand this data and can provide insight into what these differences mean. Additionally, as these methods are already being used to produce forecasts at meteorological centres internationally, discussions about the decision process and impact of different methods are bound to be illuminating!

DARE first field trip to Tewkesbury

DARE first field trip to Tewkesbury

The Dare team went on a field trip last month! It was a well planned and executed trip – as you would expect from a group of mathematicians. It was also a very interesting trip for us since most of us have only ever used data (e.g. for improving forecasts) not collected it. Even better Tewkesbury area has become a sort of benchmark for testing new data assimilation methods, ideas, tools, observations, etc, and so many of us have worked with LisFlood numerical model (developed by a team led by Prof. Paul Bates at the University of Bristol) over the Tewkesbury domain. We have seen the river runs in the model outputs, watched the rivers Avon and Severn go out of banks in our plots, and investigated various SAR images of the area but we have never been to the area. We generally do not need to visit the area when working with the models, however, now that there was a chance to do so, it was no surprise that many of us were keen to go. And we did go like ‘d’ A-team:

Figure 1. ‘d’A-team have arrived!


However we had a more important reason for visiting too – we were going to the Tewkesbury area to collect metadata from a number of river cameras located near Tewkesbury town. These river cameras are high definition webcams owned and serviced by Farson Digital Ltd in various location over the UK. We had recently discovered that six of such cameras are within the LisFlood model domain and have captured the November 2012 floods in the area. With the permission from the Farson Digital Ltd, we have obtained hourly daylight images of the floods from 21st November 2012 to 5th of December 2012. Hence, the aim our trip was to obtain accurate (with errors of no more than few centimeters) positional information (i.e. latitude, longitude, height) of the cameras themselves as well as the positional information of a number of markers in the images for each of the cameras. We need this information to extract as accurately as possible water extents and water depth from these images using image processing tools (which we are currently working on).

Figure 2. Rivers Avon and Severn domain map in the LisFlood model with the six river cameras located where the red/white circles are positioned.


To take these measurements we had borrowed some tools from the Department of Geography at the University of Reading. We used a differential GPS tool (GNSS) to very accurately (on order of few centimeters) measured the position of a given point in 3D space, that is its latitude, longitude, and height above the sea level, however, it had to be used on the ground (e.g. could not measure remote or high points such as building corners where some cameras were mounted) and not be too close to buildings or large trees. To measure remote and high points we used Total Station, which allowed us to shoot a laser beam to the desired point to measure its 3D position in space.

Figure 3. Steve Edgar from the Environmental Agency showing masterclass with TotalStation.


We had planned to visit all six cameras within the space of the two days 16th and17th of April, however, despite our best plans and fantastic organisation skills we were too ambitious with our time and we had to drop the camera furthest from our base – the Bewdley camera (see map with camera positions in figure 2). Thus, on our first day, we took measurements from Wyre Piddle, Evesham, and Digglis Lock cameras, spotting ourselves live on the Farson Digital Ltd site.

Figure 4. Dare team spotted at the Evesham site live on Farson Digital Ltd river cameras on 16th of April.


We returned to our base – the Tewkesbury Park Hotel, to be joined by the Ensemble team from the Lancaster University. Ensemble Project is lead by Prof. Gordon Blair, and as Dare is funded by the EPSRC Senior Fellowship in Digital Technology for Living with Environmental Change. It was very interesting to meet the Ensemble project team and learn more in-depth about their work, future interests, and scope for the collaboration.

Figure 5. Prof. Gordon Blair (University of Lancaster) giving an overview of the Ensemble project and introducing his team.


On our second day, the Dare team visited the Tewkesbury camera while the Ensemble team learned more about the purpouse of the data collection and the Novermber 2012 floods in the area. Then we all jointly measured a large number of points at the Strensham Lock. In 2012 we all would have been totally sumberged in water in this picture since the flood waters completely swallowed the island on which the house is standing flooding the building along with it.

Figure 6. Dare and Ensemble project teams at the Strensham Lock.


Our grand finale was the meeting with the director of the Farson Digital Ltd Glyn Howells as well as a number of stakeholders who have commissioned the cameras we visited. It was very interesting for us to learn how the network of the river cameras was born from the need to know and understand the current state of the river for a variety of river users – fishermen, campers, boaters, etc. Also, how these cameras have become invaluable assets to many stakeholders for various reasons – greatly reducing the number of river condition related phone enquiries, monitoring river bank and bridge conditions, and so on.

Now a month later we have downloaded and processed the data we collected from these stations. In figure 7 we have plotted the data points we took at the Tewkesbury site, owned both by the Environmental Agency and Tewkesbury Marina (both of which we greatly thank for their support and assistance before and during our trip, especially to Steve Edgar from EA and Simon Amos and Bruno from Tewkesbury Marina). In the figure, the red dots are the camera positions – pre-2016 and current camera positions, and the black dots are all the other measurements we took using both the TotalStation and GNSS tools, which are plotted against the Environmental Agency lidar data with 1m horizontal resolution.

Figure 7. Locations of the point measurements from both Total Station and GNSS we took at the Tewkesbury town. Red dots are camera locations (pre-2016 and current positions), black dots are measurements of various reference points that can be seen from the camera. To make this image we used the Environmental Agency 1m horizontal resolution lidar data and LandSerf open source GIS software.


We are currently working on extracting the water extent from these images which we then will use to produce water depth observations. Our final aim is to see how much forecast improvement such rich source of observations offer, in particular, before the rising limb of the flood.

We are very thankful to Glyn Howells and the various stakeholders for permitting us to use of the images, allowing us to take the necessary measurements, assisting us on the sites, and joining at the workshop!

Can we use future data to improve our knowledge of the ocean?

Can we use future data to improve our knowledge of the ocean?

by Chris Thomas

An interesting problem in climate science is working out what happened in the world’s oceans in the last century. How did the temperature change, where were the currents strongest, and how much ice was there at the poles? These questions are interesting for many reasons, including the fact that most global warming is thought to be occurring in the oceans and learning more about when and where this happened will be very useful for both scientists and policymakers.

There are several ways to approach the problem. The first, and maybe the most obvious, is to use the observations that were recorded at the time. For example, there are measurements of the sea surface temperature spanning the entire last century. These measurements were made by (e.g.) instruments carried on ships, buoys drifting in the ocean, and (in recent decades) satellites. This approach is the most direct use of the data, and arguably the purest way to determine what really happened. However, particularly in the ocean, the observations can be thinly scattered, and producing a complete global map of temperature requires making various assumptions which may or may not be valid.

The second approach is to use a computer model. State-of-the-art models contain a huge amount of physics and are typically run on supercomputers due to their size and complexity. Models of the ocean and atmosphere can be guided using our knowledge of factors such as the amount of CO2 in the atmosphere and the intensity of solar radiation received by the Earth. Although contemporary climate models have made many successful predictions and are used extensively to study climate phenomena, the precise evolution of an individual model run will not necessarily reproduce reality particularly closely due to the random variation which often occurs.

The final technique is to try to combine the first two approaches in what is known as a reanalysis. The process of reanalysis involves taking observations and combining them with climate models in order to work out what the climate was doing in the past. Large-scale reanalyses usually cover multiple decades of observations. The aim is to build up a consistent picture of the evolution of the climate using observations to modify the evolution of the model in the most optimal way. Reanalyses can yield valuable information about the performance of models (enabling them to be tuned), explore aspects of the climate system which are difficult to observe, explain various observed phenomena, and aid predictions of the future evolution of the climate system. That’s not to say that reanalyses don’t have problems, of course; a common criticism is that various physical parameters are not necessarily conserved (which can happen if the model and observations are radically different). Even so, many meteorological centres around the world have conducted extensive reanalyses of climate data. Examples of recent reanalyses include GloSea5 (Jackson et al. 2016), CERA-20C, MERRA-2 (Gelaro et al. (2017)) and JRA-55 (Kobayashi et al. (2015)).

When performing a reanalysis the observations are typically divided into consecutive “windows” spanning a few days. The model starts at the beginning of the first window and runs forward in time. The reanalysis procedure pushes the model trajectory towards any observations that are in each window; the amount by which the model is moved depends on how much we believe the model is correct instead of the observation. A very simplified schematic of the procedure can be found in Figure 1.

Figure 1: A very simplified schematic of how reanalysis works. The data (black stars) are divided into time windows indicated by the vertical lines. The model, if left to its own devices, would take the blue trajectory. If the data are used in conjunction with the model it follows the orange trajectory.

This takes us to the title of the post. Obviously it’s not actually possible to use data from the future (without a convenient time machine), but the nice aspect of a reanalysis is that all of the data are available for the entire run. Towards the start of the run we have knowledge of the observations in the “future”; if we believe these observations will enable us to push the current model closer to reality it is desirable for us to use them as effectively as possible. One way to do that would be to extend the length of the windows, but that eventually becomes computationally unfeasible (even with the incredible supercomputing power available these days).

The question, therefore, is whether we can use data from the “future” to influence the model at the current time, without having to extend the window to unrealistic lengths. The methodology to do this has been introduced in our paper (Thomas and Haines (2017)). The essential idea is to use a two-stage procedure. The first run is a standard reanalysis which incorporates all data except the observations that appear in the future. The second stage then uses the future data to modify the trajectory again. Two stages are required because the key quantity of interest is the offset between the future observations and the first trajectory; without this, we’d just be guessing how the model would behave and would not be able to exploit the observations as effectively.

Our paper describes a test of the method using a simple system: a sine-wave shape travelling around a ring. Observations are generated at different locations and the model trajectory is modified accordingly. It is found that including the future observations improves the description relative to the first stage; some results are shown in Figure 2. The method has been tested in a variety of situations (including different models) and is reasonably robust even when the model v aries considerably through time.

Figure 2: Results obtained when using the new methodology in a simple simulation. The left hand plot shows the results after the first stage, and the right hand plot shows the results after the second stage. In each plot the horizontal axis is space and the vertical axis is time. Values closer to zero (the white areas) indicate the procedure has performed well, whereas values further from zero (the blue and orange areas) indicate it has not been as successful. The second stage has more white areas, showing an improvement over the first. (The labels at the top of each plot indicate where the observations are located.)

We have now implemented the method in a large-scale ocean reanalysis which is currently running on a supercomputer. We are particularly interested in a process known as the AMOC (Atlantic Meridional Overturning Circulation) which is a North-South movement of water in the Atlantic Ocean (see Figure 3 for a cartoon). It is believed that the behaviour of water in the northernmost reaches of the Atlantic can influence the strength of circulation in the tropical latitudes; crucially, this relationship is strongest at a time lag of several years (Polo et al. (2014)). Data collected by the RAPID measurement array in the North Atlantic ( take the role of the “future” data in the reanalysis and are used to modify the model trajectory in the North Atlantic. The incorporation of RAPID data in this way has not been done before and we’re looking forward to the results!

Figure 3: A cartoon of the AMOC and the RAPID array in the North Atlantic (adapted from The red and blue curves indicate the movement of water. The yellow circles indicate roughly where the RAPID array is located.


Jackson, L. C., Peterson, K. A., Roberts, C. D. and Wood, R. A. 2016. Recent slowing of Atlantic overturning circulation as a recovery from earlier strengthening. Nat. Geosci. 9(7), 518–522.

Kobayashi, S. et al. 2015. The JRA-55 Reanalysis: General specifications and basic characteristics. J. Meteor. Soc. Japan Ser. II 93(1), 5–48.

Gelaro, R. et al. 2017. The Modern-Era Retrospective Analysis for Research and Applications, Version 2 (MERRA-2) J. Clim. 30(14), 5419–5454.

Thomas, C. M. and Haines, K. 2017. Using lagged covariances in data assimilation. Accepted for publication in Tellus A.

Polo, I., Robson, J., Sutton, R. and Balmaseda, M. A. 2014. The Importance of Wind and Buoyancy Forcing for the Boundary Density Variations and the Geostrophic Component of the AMOC at 26°N. J. Phys. Oceanogr. 44(9), 2387–2408.