Interpolation of observed weather

From Agri4castWiki
Revision as of 21:09, 28 February 2021 by Hendrik2 (talk | contribs) (Repair missing weather - workflow)
Jump to: navigation, search


General description

Interpolation from weather stations to regular climate grid.

The daily meteorological data is interpolated towards the centre of a regular climatic grid. The grid definition is stored in object GRIDS. The data of the climatic grid is stored in object WEATHER_OBS_GRID. In the case of the European window (EUR), the regular grid projected in coordinate system ETRS_1989_LAEA.


The grid cell size is based on the assumption that within a region of 25 by 25 kilometers the meteorological data are homogeneous. It is expected that temperature, sunshine, humidity, and wind speed gradually change over distances of 50 to 150 kilometers. Local weather variations may be larger than changes over such distances, most obviously for solar radiation caused by differences in slope aspect, or presence of persistent clouds. Temperature conditions near the surface follow variations in solar radiation but there are also temperature gradients due to cold air drainage or differences in surface cover and in topography. Sometimes large differences occur over a short distance (10-40 km) when there is sharp transition between two different air masses, e.g. when a front passes or when a region under a persistent cloud cover exists adjacent to a region under a cloudless sky (van Diepen, 1998). Beek (1991) reports that if daily means are used, most meteorological variables do not differ too much over distances in the range of 50-150 km.

More complicated is the spatial variation in precipitation, usually in the form of rainfall. Rain may fall from a local cumulonimbus in showers with high density (convective rains), or in a front passage with low density over large areas (depression rains). Geographic patterns of rainfall are influenced by the geometry of land and sea surfaces, and by general circulation patterns. Western facing slopes of hill and mountain ranges receive more than average rainfall, east facing slopes less. For that, the spatial distribution patterns of rainfall are thus irregular (Beek, 1991; van Diepen, 1998).

The methodology for the spatial interpolation of the data of the existing network of meteorological stations towards the climatic grid cells centre is based on different approaches:

  • CGMS: classical approach taking the weighted average of preselected set of stations most similar to the grid cell centre. This is applied for all relevant elements. This approach is based on the studies of Beek, 1991 and van der Voet et al., 1994. It is described by van der Goot, 1998a. This method was chosen because its simple approach made it easy to automate while the accuracy was sufficient to serve as input to the crop growth model. The interpolation is executed in two steps: first the selection of suitable weather stations to determine representative meteorological conditions for a specific climatic grid cell. Second, a simple average is calculated for most of the meteorological parameters, with a correction for the altitude difference between the station and grid cell centre in case of temperature and vapor pressure. As an exception rainfall data are taken directly from the most suitable station.
  • Regression kriging: a geo-statistical methods based on a regression model, calibrated on training data, and simple kriging of the regression residuals. This is applied for precipitation only.

The final interpolated product is composed of precipitation interpolated via regression kriging and all other elements interpolated via the CGMS approach. This final choice followed from an internal intercomparison study in which regression kriging outperformed CGMS for precipitation but not for maximum temperature. The latter was assumed representative for the other elements.

It must be stressed that the consequence of the selected interpolation is that values obtained for each climatic grid cell represents an ‘average’ daily condition. They do not necessarily represent meteorological conditions that could be measured at the climatic grid cell centre. For instance, the altitude used is not the altitude that can be measured at the climatic grid cell centre, but rather a value that represents the mean altitude of the agricultural activity in the concerned cell. In addition different elements like rainfall and temperature can be based on different stations as they belong to different interpolation classes.

CGMS: general selection of weather stations (temporal availability)

Not all weather stations broadcast a complete set of data and not all stations broadcast continuously. To increase interpolation reliability, and to reduce the computational requirements, the CGMS performs checks on data availability of weather stations. First a classification is made with respect to the data type that stations can deliver. Five classes are distinguished:

  • PRECIPITATION
  • SNOWDEPTH
  • TEMPERATURE
  • REST (radiation, evapotranspiration, vapour pressure and wind speed)
  • HUMIDITY

Secondly, only weather stations within a radius of 250 kilometers around the climatic grid cell centre are candidates for interpolation. The radius is defined in the table INTERPOLATION_PARAMETERS (parameter IPOL_MAXDIST).

The selection of available stations can be done differently for the current year (current year mode) and the historic years (historic mode). Main difference is that in the current year mode an interpolation timeseries can be based on a varying station configuration while the historic mode has a fixed station set within a year. In the current services the current year mode is applied for all historic years! Main reason is to make better use of the available data as in the historic mode not all observations are used (see below for details). Another advantage is that no climatology information is needed to fill gaps in time-series for historic years anymore. A possible drawback is the changing station configuration within a historic year, which may affect temporal patterns in the interpolated weather of grid cells.

Current year

The current year is defined in the object INTERPOLATION_PARAMETERS. For each climatic grid cell, for each day, for each of the above five classes all stations are selected that are within a radius of 250 km around the grid cell. Thus, in the current year, each combination of grid cell, day and class can have its own unique list of stations. With regard to relative humidity the availability check of a station also include a check on the complete day: a station is available if the temperature and humidity observations of all 3-hourly time steps within a day are available. In case no stations are found for a combination of climatic grid cell, day and class, all stations are selected that are within a radius of 250 km around the climatic grid cell and have data for this specific day and class in object WEATHER_OBS_STATION_TA ánd are considered reliable. Reliability is defined per station in object STATIONS.


Historic years

Note that this mode is not being used in the operational service anymore but it is described for reason of completness. For interpolation of historic station weather each climatic grid cell uses the same set of stations for a complete year. For most recent years around 3500-4000 stations have a sufficient temporal coverage (enough observations within one year). Only these are used in the interpolation procedure. The selection procedure determines for each weather station if the data availability in object WEATHER_OBS_STATION and object WEATHER_OBS_STATION_CALCULATED for a class exceeds a certain threshold for each complete year separately. The threshold value can be configured per station and per class. Currently the classes PRECIPITATION, TEMPERATURE, REST and HUMIDITY have a threshold of 79.9% for all stations (defined in object STATIONS). It means i.e. if the station data in a particular class for a historic year is for more than 79.9% complete, the station is a candidate for the interpolation of the data in the concerned class. In case of class SNOWDEPTH the threshold is equal to 0% meaning that any station can be used for interpolation, as long as the station has at least one snow depth measurement for the selected historic year. Reason is the low temporal availability of snow observations during the year.

One additional check is done: the station in the selected class must have climatology data for all 366 calendars days stored in table WEATHER_OBS_STATION_LTA. After determination of availability, CGMS writes the results into the table WEATHER_OBS_STATION_AVAIL. Only these stations are used in the interpolation for historic years.

With regard to relative humidity the availability check of a station also include a check on the complete day. A station is available if the temperature and humidity observations of all 3-hourly time steps within a day are available (complete day) followed by the check of more than 79.9% of all days in that year have complete days.

CGMS: final selection of weather stations

To select the final station(s) to be used for a climatic grid cell on a specific day and for a specific class, CGMS applies a selection procedure that relies on the similarity of the 'source' station and the 'targetted' climatic grid cell centre. For the classes PRECIPITATION, SNOWDEPTH and HUMIDITY only 1 station is selected. For these classes the similarity score is expressed as the result of an algorithm that takes the following criteria into account:

  1. distance between the station and the grid cell centre
  2. difference in altitude between station and grid cell
  3. difference in distance to the coast between the station and the grid cell centre
  4. location of the station and the grid cell centre with regard to uniform weather regions

The higher the score, the lower the similarity between the station and the grid cell centre.



For the classes TEMPERATURE and REST, the interpolation to a climatic grid cell can use data from 1 up to 4 stations. To determine the most suitable set of stations the CGMS calculates a set score (combination score) in a similar way as the single station score (see similarity score) for each class TEMPERATURE and REST (vapour pressure, windspeed, evapotranspiration, radiation). It is based on the mean scores of a set of stations. Only the stations that satisfy the temporal availability check are included in the selection.


Theoretically, the set score could be calculated for all possible combinations of one up to four weather stations, taken from all available weather stations with sufficient observations. However obviously this would lead to many unnecessary calculations. Therefore, the CGMS determines the set score only for all combinations of one up to 4 weather stations, taken from the seven weather stations most similar to the climatic grid cell centre. This results in 98 sets for which the set score has to be calculated.

CGMS: interpolation of rainfall

Interpolated rainfall should be realistic in terms of number of rainy days, amount of rainfall and temporal distribution. The effects on the soil water balance and consequently on crop growth simulation of various small daily showers is different than for example one large rainfall event per week. When rainfall data from several surrounding stations are averaged, the rainfall peaks are leveled off and the number of rainy days increases. Therefore the CGMS takes for each climatic grid cell the rainfall of the most similar station i.e. the weather station with the lowest score for the concerned grid cell that has data. Only stations that satisfy the temporal availability check are included in the selection.

Stations can have missing data for a specific day. In case of historic years a maximum of 20.1% missing values is allowed (see temporal availability check). If the most similar station has no data for a certain day the second most similar station is checked etc. When no data are available for the 7 most similar stations, the long term average value of the most similar station is taken from object WEATHER_OBS_STATION_TA. Note that the current service does not make use of the CGMS historic mode.

In case of the current year all the selected stations have data for the day selected (see temporal availability check). Still it is possible that within a radius of 250 km of a grid cell no stations are found that have observations for class PRECIPITATION. To secure data availability for the 25 km grid cell the long term average value of the object WEATHER_OBS_STATION_TA is selected. In this case only stations are used that are identified as reliable stations in object STATIONS.


CGMS: interpolation of snow depth

Snow depth is interpolated in a similar manner as rainfall. For each climatic grid cell the snow depth of the most similar station is taken i.e. the weather station with the lowest score for the concerned grid cell that has data. Only stations that satisfy the temporal availability check are included in the selection. However for snow depth the threshold is set to 0% (see temporal availability) meaning that all stations with one or more observations in a historic year will be selected. Missing data of a selected station for historic years is treated in the following way. If the most similar station has no data for a certain day the second most similar station is checked etc. When no data are available for the 7 most similar stations, no snow depth data (NULL value) is returned for the selected grid cell. Note that the current service does not make use of the CGMS historic mode.

In case of the current year all the selected stations have data for the day selected (see temporal availability check). Still it is possible that within a radius of 250 km of a grid cell no stations are found that have observations for class SNOWDEPTH. Unlike rainfall such grid cells will not receive a snow depth value based on climatology.


CGMS: interpolation of other weather data

Once the best set of weather stations for each climatic grid cell has been determined, as expressed by the minimum set score, data of these stations are weighted averaged to obtain the climatic grid cell value. The weights are derived from the similarity scores between selected stations and target grid cell. Temperature and vapor pressure are corrected for the difference in altitude between the selected weather station and the climatic grid cell centre. The correction factors used for the temperature and vapor pressure are respectively -0.006 (°C.m-1) and -2.5% per 100 meter increase (van der Voet et al.,1994).

Stations can have missing data. In case of historic years a maximum of 20.1% missing values is allowed (see temporal availability check). If for a certain day data is missing simply the long term average value of that day is taken from object WEATHER_OBS_STATION_TA. Note that the current service does not make use of the CGMS historic mode.

For the current year all the selected stations have data for the selected day (see temporal availability check). Still it is possible that within a radius of 250 km of a grid cell no stations are found that have observations for class TEMPERATURE or REST. To secure data availability for the 25 km grid cell the long term average value of the object WEATHER_OBS_STATION_TA is selected. In this case only stations are used that are identified as reliable stations in object STATIONS. For determining reliable stations see 'determining reliable stations'.


CGMS: interpolation of humidity

Relative humidity is interpolated in a similar manner as snow depth. For each climatic grid cell the relative humidity of the most similar station is taken i.e. the weather station with the lowest score for the concerned grid cell that has data. Only stations that satisfy the temporal availability check are included in the selection. Missing data of a selected station for historic years is treated in the following way. If the most similar station has no data for a certain day the second most similar station is checked etc. When no data are available for the 7 most similar stations, no relative humidity data (NULL value) is returned for the selected grid cell. Note that the current service does not make use of the CGMS historic mode.

In case of the current year all the selected stations have data for the day selected (see temporal availability check). Still it is possible that within a radius of 250 km of a grid cell no stations are found that have observations for class HUMIDITY. Unlike rainfall (and similar to snow depth) such grid cells will not receive a relative humidity value based on climatology.

Because relative humidity cannot be corrected for altitude differences between the station and the target climatic grid cell, relative humidity is converted to actual vapour pressure before interpolation. For each climatic grid cell the most similar station is selected. Next, for each time step (06, 09, 12, 15, 18 UTC) for a selected station:

  • the saturated vapour pressure is calculated using the temperature and formula: Es = 6.11 * exp((17.4*T)/(T+239)).
  • the actual vapour pressure is calculated by multiplying saturated vapour pressure with the relative humidity divided by 100.

Next:

  • the actual vapour pressure of the similar station is corrected for altitude differences between the similar station and the centre of the climatic grid cell applying a correction factor of vapour pressure (a decrease of 2.5% for an increase of 100 m). Note this correction is also applied in the interpolation of actual vapour pressure in CGMS (see Interpolation of other weather data).
  • the saturated vapour pressure of the climatic grid cell is calculated by using the temperature of the similar station with a correction for altitude (a decrease in temperature of 0.6 degree Celsius for an increase of 100 m).
  • finally, the relative humidity is calculated by dividing the actual vapour by the saturated vapour pressure of the grid cell and multiply it with 100.
  • after applying the altitude corrections the value of relative humidity is checked for the physical plausible limits: 0 and 100%.


Regression kriging: interpolation of precipitation

Geo-statistical methods approach the spatial interpolation problem by first defining a statistical model of the variable of interest in this case precipitation. This model usually includes a trend and a stochastic residual. The trend may be taken as a linear or non-linear combination of covariates (a.k.a. explanatory variables), such as elevation, terrain, land cover and climate. The stochastic residual has zero mean and constant variance, but it may be spatially correlated, as characterised by a semi-variogram. Given the model, spatial interpolation is achieved by regression kriging.

This boils down to application of the regression model as calibrated on training data and simple kriging of the regression residuals. The kriging step will only add value if the residual is spatially correlated. Kriging also takes a weighted average, where the weights are derived from the semi-variogram and spatial configuration of the data points.

The following steps were carried out to set-up and run the interpolation:

  • Collect covariates: elevation, vegetation indices of MODIS, location, precipitation of CRU and daily precipitation sums from ECMWF models: ERA-Interim (ERA) and the operational HRES model (HIS);
  • Model the trend, i.e., find a relation between the response variable and (a subset of) the covariates;
  • Use this model to predict the trend at all observation locations (meteorological stations);
  • Calculate the residuals, i.e., the differences between the observed response variable and the trend;
  • Estimate the empirical semi-variogram of the residuals by means of the method of moments (e.g. Goovaerts, 1997));
  • Model the empirical semi-variogram of the residuals;
  • Apply simple kriging to predict residuals at all prediction locations and add these to the trend at these locations;

To model the trend the Cubist model was used (Quinlan, 1992,Quinlan, 1993, http://rulequest.com/cubist-info.html). A Cubist model is a rule based regression tree where the terminal leaves contain simple linear regression models. Given a set of covariates and the response variable, a Cubist model uses the covariates to construct a set of rules and regression equations that together predict the response. This can be accomplished in various ways. The default settings of the Cubist package in R (Kuhn & Quinlan, 2018) were used. The Cubist models is particularly useful for modelling precipitation as it is able to discriminate between events with different amounts of precipitation, varying from zero in dry periods, to locally high precipitation sums during summer showers.

To fit Cubist-models, the data was randomly split into a training set and a test set. The training set was twice the size of the test set. All months were equally represented in the training set and test set. Highly correlated variables were not selected in the same model. This step revealed the relative importance of each covariate. Based on these results, the complexity of the model was reduced by retaining only the most important covariates while still maintaining predictive power as confirmed by the test set: precipitation sums from ECMWF models (ERA-Interim (ERA) and the operational HRES model (HIS)).

The residuals, i.e., the differences between observed and Cubist-predicted precipitation amounts, are subsequently modelled by simple kriging (e.g., Goovaerts, 1997). To reduce computation time, only stations within a search radius of 100 km from the prediction point are considered for each prediction. The final prediction is the sum of the trend as predicted by the Cubist model and the simple kriging predictions of the residuals. For locations lacking station data within a search radius of 100 km the final prediction is based on the trend as predicted by the Cubist model.

One of the challenges is to automatically estimate the sample semi-variogram and fit a permissible model to it as this is normally an interactive process. The flexible Matérn model, the Spherical model, and the Exponential model to fit the empirical semi-variogram by means of the gstat-package (Pebesma, 2004) are used. To increase the probability of a successful fit, many sets of initial parameter values are taken and afterwards the fit with the smallest residual error is selected. Note that for the whole region each day one semi-variogram is estimated and fitted.

It is well known that (regression) kriging smooths the predictions. Hence, in addition to the regression kriging variants above, a post-processing method was applied mentioned in Journel & Xu, 1994 p.324 and Goovaerts, 1998, p.520 to reduce smoothing and bias:

xp* = Fd-1(Fp(xp))

where Fd-1 is the inverse of the cumulative distribution function of the original data at a specific date and Fp the cumulative distribution function of the predictions xp at the same date. Both cumulative distributions are derived for each day over the whole of Europe. xp* are the post-processed (corrected) predictions.

The interpolation is done in R using the packages Cubist, RJDBC and MARSOPRK.

The regression kriging for precipitation is fully automated. Input are read from the objects:

  • STATIONS (locations)
  • WEATHER_OBS_STATION (observations)
  • GRIDS (locations)
  • WEATHER_ERA_GRID (co-variate)
  • WEATHER_HIS_GRID (co-variate)

The interpolated precipitation is written in object WEATHER_OBS_GRID_RK for both the regular and the unsmoothed version.

Repair missing weather - workflow

xxx


Next the ORACLE package Workflow is called, 00workflow_daily.bat, for the specific day (not for CGMS14CHN). It does three tasks: • Filtering the old style interpolated weather removing all the LTA gap filled values using InsertPureObs procedure from the workflow package • Creation of “observed” data generated from the ECMWF-HIS covariate using CreateObsFromECMWFCGFS procedure from the workflow package

• Merging of WEATHER_OBS_GRID_CGMS_PURE, WEATHER_OBS_GRID_MD and WEATHER_OBS_GRID_RK using the GRIDWEATHERMODEL and GRIDWEATHERCGMS procedures from the workflow package