Forecasting methods
Background
Various authors have proposed to subdivide crop yield in three components: mean yield, multiannual trend and residual variation (e.g. Vossen, 1989; Dagnelie et al., 1983; Dennett et al., 1980; Odumodu and Griffits, 1980). It is assumed that the interacting effects of climate, soil, management, technology, etc. determine the mean yield. Observed national, regional and subregional yields show a trend in time. The trend is mainly due to longterm economic and technological dynamics such as increased fertiliser application, improved crop management methods, new high yielding varieties, etc. The third component, the residual variation, is considered to be the variation among years (Dennett et al., 1980). It is exactly this part which should be explained by weather, crop and remote sensing indicators.
According to Dennett et al. (1980) and Odumodu and Griffits (1980), the technological time trend should be removed from the crop yield time series, assuming that the residual variation is independent of that trend. This approach can be summarised as (Vossen, 1989):
Y_{T,obs} equals Y_{avg} + f(T) + e 

where:

Palm and Dagnelie (1993) fitted various time trend functions to national yield series (ton.ha 1 ) of several crops for 9 EU member states. Regressions were executed for the period prior to 1983 and a forecast for 1983 was made. This procedure was repeated for successive years up till 1988. The prediction results were compared with national yield values. Of the tested functions a quadratic function of time performed best. However, differences with a simple linear trend function were small. In a next step, these authors removed the trend from the yield series using the quadratic function. The residuals for the period prior to 1983 were regressed against various meteorological parameters and a prediction for 1983 was made. Again, this procedure was repeated for successive years up till 1988. This was done for 19 Departments in France . Comparing the predicted and official yield series demonstrated that the applied meteorological variables did not improve the prediction accuracy.
Swanson and Nyankori (1979) for corn and soybean production in the USA , Sakamoto (1978) for wheat production in South Australia, Agrawal and Jain (1982) for rice yields in the Raipur District in India, considered the technological timetrend dependent on the residual variation. According to Winter and Musick (1993), Hough (1990) and Smith (1975), weather affects farm management practices such as planted area, timing of field operations, application of inputs, etc. Hence, the time trend should be analysed simultaneously with the explaining variables. This approach can be summarised as (Vossen, 1989):
Y_{T,obs} equals b_{o} + f(T) + f(weather) + e 

where:

Swanson and Nyankori (1979) showed that the time trend was underestimated when weather data were not analysed simultaneously with the time trend. Similar results were found for millet in Botswana (Vossen, 1989).
The previous equation does not account for the interaction between crop growth and weather variability. Also root characteristics and soil physical properties are not accounted for. Therefore Vossen (1990b, 1992) proposed to use crop growth simulation results to describe yeartoyear yield variation. In a crop growth simulation model weather and soil characteristics are summarised and crop characteristics, including yield form the output, i.e. simulation results quantitatively represent the influence of weather variables on crop growth. The yield can be written as:
Y_{T,obs} equals b_{o} + f(T) + f(simulation) + e 

where:

Prediction model
Official statistics of regional mean yields are predicted by the CGMS using one of the following simulated predictors (see Crop Simulation):
 Potential above ground biomass (ton.ha1 dry weight)
 Water limited above ground biomass (ton.ha1 dry weight)
 Potential storage organs biomass (ton.ha1 dry weight)
 Water limited storage organs biomass (ton.ha1 dry weight)
Originally, it was intended to predict yields by solely using the water limited weight of storage organs in the prediction model. Later on, the other three were added. Water limited yield, for instance, is inappropriate for a region with a lot of irrigation. Furthermore drought stress can be strongly reduced in case of groundwater influence. This factor is not included in the CGMS. The simulated biomass indicators were added because these are more robust, and less sensitive to modelling errors in the distribution of assimilates. Moreover they also allow yield prediction during the growing season, when grain filling has not yet started or grains are still very small (de Koning et al. , 1993).
Although by default only the 4 above listed crop indicators are taken into account, regression models can be constructed from a any combination of indicator available in the Weather Monitoring module, Crop Simulation module and Remote Sensing module. These models can be constructed with SPSS or the user interface of the CGMS statistical tool).
The statistical sub system of the CGMS uses a combination of a linear time trend and crop growth simulation results as proposed by Vossen (1990b, 1992). This prediction model can be described as:
Y_{T} equals b_{0} + b_{1}T + b_{2}S_{T} 

where:
Suboptimal production circumstances such as drought, low temperatures etc. are allowed for by the constant b_{2}, which should lie between 0 and 1. 
Per region, for a moving window of at least 9 years, the regression coefficients are established and subsequently used for yield prediction of the 10th year (‘oneyearahead'). The selection of the predictor to forecast the final yield is as follows:
 Each candidate predictor is fitted to the data currently available for this region.
 Candidates with a negative estimate of b_{2} are rejected because of the nature of the process.
 From the remaining ones, that with the lowest jackknife mean square error is selected.
Jackknife errors 

Jackknife errors are calculated by simulating that an observation is absent and that the predictor is used to assess its value. It reveals the error in predicting the observation which had been kept out of sight. Obviously, jackknife errors are not entirely relevant in the present situation where we want to predict the future rather than to reconstruct the past. For direct application it is more relevant to investigate the prediction of the oneyearahead. Still the jackknife method is used because the jackknife errorsize estimates are less variable, being based on a larger number of predictions. With the same number of observations ‘n' the jackknife method has ‘n' error estimates while the ‘one year ahead' prediction, has only ‘ny' error estimates where ‘y' is the number of years on which the prediction is based. More detailed descriptions are given by de Koning et al. (1993) and Jansen (1995). 
A quadratic trend function is also considered in the CGMS. However, based on results of Palm and Dagnelie (1993) and de Koning et al. (1993), it was concluded that a linear trend sufficiently describes the increasing official yields. A smooth trend of any type over a large number of years assumes a continuity which might be unrealistic (de Koning et al. , 1993; [[ReferencesVossen, 1992[[References; Vossen, 1990a). According to Vossen and Rijks (1995) the predictor should only be based on data from the recent past. The length of the series should nevertheless be long enough to give a sufficient number of degrees of freedom in the regression analysis. Gradual shift in the time trend is allowed for by the shortness of the time series, used to derive the predictor.
Required input data are stored in the tables
 DATA_FOR_YIELD_FORECAST (GUI version)
 CROP_YIELD (Batch version)
 EUROSTAT
 NUTS
 STAT_CROP
The statistics have a wider range of crops than the ones considered by the Crop Simulation. Therefore yields of some of the 'statistical crops' are forecasted using the same 'CGMS crop'. This relation is stored in table STAT_CROP.
To be able to run the forecast in batch mode, all model parameters are stored in advance in tables:
 RUN
 MODEL_EXCL_YEARS
 MODEL_INCL_INDICATORS
 MODEL_REGR_INDICATIFS
 MODEL_SCEN_SIM_YEARS
 MODEL_SCEN_INDICATIFS
Each ten days the all stored models are run an results are written to the tables:
 FORECASTED_NUTS_YIELD (GUI version)
 FORECASTED_NUTS_YIELD_HIS (Batch version)
Before the start of each growing season, yield forecast are produced based on the long term average and corrected for a technological trend. The MARS analyst can change the length of the time series. This redefines the trend function and results in different CGMS level 3 forecasts.
Trend analysis
When for a certain combination of country and crop the accuracy is deemed not to be sufficient, the MARS analyst start to redefine trend periods and functions using Excel, SPSS or the user interface of the CGMS statistical tool.
First, trends for a longer period (1975 until current year) are determined if yield statistics for such a period are available. Next, trends for more recent periods are studied. For Eastern Europe the period after 1990 is used (to exclude strong changes caused by political changes around 1990). For countries within the European Union the period after 1992 is important because in 1992 the Common Agricultural Policy went through important changes that affected yield and planted areas.
Besides changing the trend period, different trend functions are studied. Yield statistics of each country are directly taken from the CRONOS database which is updated each month. Linear, quadratic and other type of trends are studied. MARS analysts also study the minimum and maximum trend evolution by separating the data set in two groups representing the 50% highest and 50% lowest values.
Scenario analysis in SPSS
To deal with the residual uncertainty given by the unknown evolution of the season from the moment the forecast is issued to the moment the crop is harvested, agrometeorological scenario's can be produced and analysed. The scenario analyses consist in finding the most similar agrometeorological years basing on the time series of parameters simulated by the CGMS. The analysis is based on PCA, Factor Analysis and Cluster Analysis (Hair et al., 1998). As default input crop indicators of the CGMS of all available years are used Crop Simulation. It is stressed that the climatic similarities are established basing on the time series of agrometeorological parameters. In fact year similar in climatology are not necessarily similar in crop response as small changes in the sequence of the meteorological events can have a major effect in crop behaviour, this is why the approach is run directly on the crop parameters.
The PCA gives a new combination of independent variables (factors). The first factors, explaining up to 90% of the variability, are selected and the combination of pairs of factors' axis are analysed using as unit the original variables. The Unit (years' observation) are then plotted on the new factors to characterise the years (for instance dry and hot season…).
This is repeated for each country and at crop level (we remind that the original variables are the crop growth parameters as simulated by the CGMS). The analyst launches then a cluster analysis on the new factors (normally a hierarchical cluster) obtaining groups of homogeneous year according to obtained factors. Similarity or dissimilarity matrixes help to put a hierarchy on similarities among years. Once the score of similarities obtained and the hierarchy of similarity obtained the forecast is obtained as weighted average of the corresponding yields (in case detrended). Weights are given by the similarity indexes. From the cluster of similar years different simple statistics are also used: within the group of the similar year the maximum and the minimum values of yield are used for optimistic and pessimistic yield scenarios.
The routine used in SPSS is the following:
FACTOR /VARIABLES ds sm wlai plai wb pb twc twr /MISSING LISTWISE /ANALYSIS ds sm wlai play wb pb twc twr /PRINT UNIVARIATE INITIAL CORRELATION KMO EXTRACTION /PLOT EIGEN ROTATION /CRITERIA FACTORS(2) ITERATE(25) /EXTRACTION PC /ROTATION NOROTATE /SAVE REG(ALL) /METHOD=CORRELATOIN. GRAPH /SCATTERPLOT(BIVAR)=fact1_1 WITH fac2_1 BY year (NAME) /MISSING=LISTWISE. CLUSTER fac1_1 fac2_1 /METHOD WARD /MEASURE= SEUCLID /ID=year /PRINT NONE /PRINT DISTANCE /PLOT DENDROGRAM.
In this example ds, sm, wlai, plai, wb, pb, twc, twr stand respectively for development stage, soil moisture, potential leaf area index, water limited biomass, potential biomass, total water consumption, total water requirements, and are the parameters simulated by CGMS.
The initial data are (a dekade, crop and country/region/grid fixed) the CGMS simulations per year (year are the units).
Step 1) FACTOR In this example we extract the main variable for a given crop simulated and country and with a FACTOR analysis we reduce to few variables explaining about 90% of the variability (2 in the example). In alternative to fixing a dekade the procedure can run on several dekades, thus the number of variables could substantially increase.
Step 2) GRAPH We then obtain plots (graphs) of the original units (years) on the new axis and this will result in characterizing the current season in terms of impact on crop growth, i.e. wet and cold; wet and hot; dry and cold; dry and hot.
Step 3) CLUSTER The third step is then used to look at the similar years as the graph factor analysis could not be sufficient to find these. The cluster algorithm is here on based. The similar years are determined looking at coefficients of dissimilarity produced in the distance analysis. These coefficients are used in two ways: 1 st detect the first ten similar years (or the ones similar below a defined threshold); 2 nd use them as weights to define a prediction.
Step 4) The fourth step is the prediction derived from the similar years (not in the routine above). The pairs (year, yield) belonging to the group of similar years. This will determine a range of yields and an average (min and max can be used as scenario min and scenario max where their explanation is given by the characterization of the factor analysis). The prediction is then obtained either using the average or (better) calculating a weighted average where the weights come from the dissimilarity coefficients. In case of presence of trend all the steps is in fact run on distances from trend (the trend model choice will affect all of the results).
Example Germany soft wheat scenarios in 2003 

This analyses was made during the second dekade of May using as input all dekades of soil moisture and development stage values (years analysed from 1975 to 2003).
The scree plot on the left shows the eigenvalues of the factor analysis run on 20 variables. The first two corresponding axis (the most explanatory) are given above on the right. One can note that the first quadrant is correlated with development stage (all concentrated here) direct expression of the influence of temperatures on crops. The y axis is explained by the crop soil moisture in April. Looking at the axis counterclockwise the northeast direction in the first quadrant expresses the most hot and humid years (in terms of effect on crops), the northwest direction in the second quadrant the cold and humid years, the southwest in the third quadrant the cold and dry year, the southeast in the forth quadrant the hot and dry years. The graph below shows the position of the years in the new axis obtained (first two): In this example one can note that the 2003 was at that time not so far from the origin appearing as a year slightly dry and cold in May. However, the position of 2003 in the new system of coordinates was opposite to 2002 (year characterised by a very high level of precipitations). 
Example Spain soft wheat scenarios in 2000 

This analyses was made in March using as input all dekades of soil moisture and development stage values (years analysed from 1975 to 2000) where all the variables were analysed in the same dekade. The variables are Development stage (DS), soil moisture (SM), potential biomass (PB), potential storage organs (PS), waterlimited storage organs (WS), potential leaf area index (PLAI) and water limited leaf area index (WLAI). The difference between potential and water limited indicators is explained in section Crop Simulation:
The factor analysis gave the following results that show that the first two components explain almost 90% of the variability: Total variance explained: Extraction Method: Principal Component Analysis The corresponding plot and the contribution of each variable to the final variability is displayed in the following scree plot: Scree plot: Component Matrix: Extraction Method: Principal Component Analysis 2 components extracted. The plots below show the corresponding variables on the first 2 axis and the units (year). The year 2000 was placed among the normal year at that time. To be stressed the years in the fourth quadrant of the last chart that can be read the area of the dry and hot years (years of drought) already well characterised in March. Component plot: This technique helps to understand how the yield prediction could still change before harvesting. In theory the more the growing season advances the lower is the number of similar years remaining thus lower the uncertainty. Further studies are in course to validate the approach. 