FIDUCEO wanted an approach to harmonisation that was metrologically robust, used the available uncertainty and covariance information – not just to determine an uncertainty associated with harmonisation coefficients, but also to determine the harmonisation coefficients themselves.
Ordinary least squares (LSQ) is a commonly used approach for regression, however, it can only consider uncertainties in the derived quantity (simplistically – the y-axis) and further it treats all the observations as having independent errors. For the FIDUCEO harmonisation approach we wanted to be able to respect the uncertainties associated with all measured values and the error correlation between match-ups. Aside from these philosophical requirements, in practice, LSQ solutions have been found to cause biases which will affect the long-term stability of the series, and therefore the ability to determine a climate trend, whereas a more robust error-in-variables regression models (EIV), which can consider uncertainties associated with all variables including the ‘x-axis’ perform much better.
A simple set of simulations can illustrate the fundamental problem. We have taken a simple straight line equation (Y=A+B×X) with fixed values A = 0.0, B = 1.0 in the range 0.0 < X < 1.0 and have generated X,Y pairs where noise of 0.05 has been added to both
X and Y values. We have then fitted a straight line to the data where
- Only the uncertainty associated with Y has been included in the fitting process (LSQ)
- Uncertainties associated with both X and Y have been included explicitly with Orthogonal Distance Regression (ODR)
The first two columns in the figure below show the distribution of the deviation of the fitted parameters (denoted as p[0] and p[1] ) from the true values (A, B) as a function of the estimated uncertainty associated with p[0] and p[1] based on the solution’s covariance matrix. Also shown in red are the predicted normal distributions for a statistically consistent set of values relative to the truth. The figure makes it clear that LSQ is not capable of returning the correct value of A and B whereas the ODR solution is completely consistent within the estimated uncertainty. The right hand set of plots show the deviation of the estimated Y value from the fits from the true Y value for an X of 2.0. Again the LSQ fits are biased whereas the ODR values are not.
This is not, however, the end of the story as existing Errors in Variables (EIV) implementations, such as ODR, still do not capture the error correlation structure between the data for the match-ups and so cannot provide an optimal solution. In the FIDUCEO project we developed novel methods for a rigorous, metrological solution to the EIV regression which fully respects the match-up error correlation structure.
ODR | EIV | f-ODR | f-EIV | ||
Optimise | calibration parameters | yes | yes | yes | yes |
sensor state variables | yes | yes | – | – | |
Account | independent random errors | yes | yes | yes | yes |
common random errors | yes | yes | yes | yes | |
structured random errors | – | yes | – | yes |
The “Errors in Variables” (EIV) approach can account for structured errors, but this is slow to run. “fast-EIV” or “marginalised-EIV” uses numerical techniques to speed up the analysis. In the FIDUCEO project we tried all these methods to understand differences.