Difference between revisions of "Forecasting methods"

From Agri4castWiki
Jump to: navigation, search
(Scenario analysis in SPSS)
(Scenario analysis in SPSS)
Line 194: Line 194:
The factor analysis gave the following results that show that the first two components explain almost 90% of the variability:
The factor analysis gave the following results that show that the first two components explain almost 90% of the variability:
'''Total variance explained:'''
[[File:scenario_analysis_spain_total_variance.jpg|Total variance explained]]
[[File:scenario_analysis_spain_total_variance.jpg|Total variance explained]]

Revision as of 13:51, 19 June 2012


Various authors have proposed to subdivide crop yield in three components: mean yield, multi-annual trend and residual variation (e.g. Vossen, 1989; Dagnelie et al. , 1983; Dennet et al. , 1980; Odumodu and Griffits, 1980). It is assumed that the interacting effects of climate, soil, management, technology, etc. determine the mean yield. Observed national, regional and sub-regional yields show a trend in time. The trend is mainly due to long-term economic and technological dynamics such as increased fertiliser application, improved crop management methods, new high yielding varieties, etc. The third component, the residual variation, is considered to be the variation among years (Dennet et al. , 1980). It is exactly this part which should be explained by weather, crop and remote sensing indicators.

According to Dennet et al. (1980) and Odumodu and Griffits (1980), the technological time trend should be removed from the crop yield time series, assuming that the residual variation is independent of that trend. This approach can be summarised as (Vossen, 1989):

Palm and Dagnelie (1993) fitted various time trend functions to national yield series (ton.ha -1 ) of several crops for 9 EU member states. Regressions were executed for the period prior to 1983 and a forecast for 1983 was made. This procedure was repeated for successive years up till 1988. The prediction results were compared with national yield values. Of the tested functions a quadratic function of time performed best. However, differences with a simple linear trend function were small. In a next step, these authors removed the trend from the yield series using the quadratic function. The residuals for the period prior to 1983 were regressed against various meteorological parameters and a prediction for 1983 was made. Again, this procedure was repeated for successive years up till 1988. This was done for 19 Departments in France . Comparing the predicted and official yield series demonstrated that the applied meteorological variables did not improve the prediction accuracy.

Swanson and Nyankori (1979) for corn and soybean production in the USA , Sakamoto (1978) for wheat production in South Australia , Agrawal and Jain (1982) for rice yields in the Raipur District in India , considered the technological time-trend dependent on the residual variation. According to Winter and Musick (1993), Hough (1990) and Smith (1975), weather affects farm management practices such as planted area, timing of field operations, application of inputs, etc. Hence, the time trend should be analysed simultaneously with the explaining variables. This approach can be summarised as (Vossen, 1989):

Swanson and Nyankori (1979) showed that the time trend was underestimated when weather data were not analysed simultaneously with the time trend. Similar results were found for millet in Botswana (Vossen, 1989).

The previous equation does not account for the interaction between crop growth and weather variability. Also root characteristics and soil physical properties are not accounted for. Therefore Vossen (1990b, 1992) proposed to use crop growth simulation results to describe year-to-year yield variation. In a crop growth simulation model weather and soil characteristics are summarised and crop characteristics, including yield form the output, i.e. simulation results quantitatively represent the influence of weather variables on crop growth. The yield can be written as:

Prediction model

Official statistics of regional mean yields are predicted by the CGMS using one of the following simulated predictors (see Crop Simulation):

  • Potential above ground biomass (ton.ha-1 dry weight)
  • Water limited above ground biomass (ton.ha-1 dry weight)
  • Potential storage organs biomass (ton.ha-1 dry weight)
  • Water limited storage organs biomass (ton.ha-1 dry weight)

Originally, it was intended to predict yields by solely using the water limited weight of storage organs in the prediction model. Later on, the other three were added. Water limited yield, for instance, is inappropriate for a region with a lot of irrigation. Furthermore drought stress can be strongly reduced in case of groundwater influence. This factor is not included in the CGMS. The simulated biomass indicators were added because these are more robust, and less sensitive to modelling errors in the distribution of assimilates. Moreover they also allow yield prediction during the growing season, when grain filling has not yet started or grains are still very small (de Koning et al. , 1993).

Although by default only the 4 above listed crop indicators are taken into account, regression models can be constructed from a any combination of indicator available in the Weather Monitoring module, Crop Simulation module and Remote Sensing module. These models can be constructed with SPSS or the user interface of the CGMS statistical tool).

The statistical sub system of the CGMS uses a combination of a linear time trend and crop growth simulation results as proposed by Vossen (1990b, 1992). This prediction model can be described as:

Per region, for a moving window of at least 9 years, the regression coefficients are established and subsequently used for yield prediction of the 10th year (‘one-year-ahead'). The selection of the predictor to forecast the final yield is as follows:

  1. Each candidate predictor is fitted to the data currently available for this region.
  2. Candidates with a negative estimate of b2 are rejected because of the nature of the process.
  3. From the remaining ones, that with the lowest jackknife mean square error is selected.

A quadratic trend function is also considered in the CGMS. However, based on results of Palm and Dagnelie (1993) and de Koning et al. (1993), it was concluded that a linear trend sufficiently describes the increasing official yields. A smooth trend of any type over a large number of years assumes a continuity which might be unrealistic (de Koning et al. , 1993; [[References|Vossen, 1992[[References|; Vossen, 1990a). According to Vossen and Rijks (1995) the predictor should only be based on data from the recent past. The length of the series should nevertheless be long enough to give a sufficient number of degrees of freedom in the regression analysis. Gradual shift in the time trend is allowed for by the shortness of the time series, used to derive the predictor.

Required input data are stored in the tables

The statistics have a wider range of crops than the ones considered by the Crop Simulation. Therefore yields of some of the 'statistical crops' are forecasted using the same 'CGMS crop'. This relation is stored in table STAT_CROP.

To be able to run the forecast in batch mode, all model parameters are stored in advance in tables:

Each ten days the all stored models are run an results are written to the tables:

Before the start of each growing season, yield forecast are produced based on the long term average and corrected for a technological trend. The MARS analyst can change the length of the time series. This re-defines the trend function and results in different CGMS level 3 forecasts.

Trend analysis

When for a certain combination of country and crop the accuracy is deemed not to be sufficient, the MARS analyst start to redefine trend periods and functions using Excel, SPSS or the user interface of the CGMS statistical tool.

First, trends for a longer period (1975 until current year) are determined if yield statistics for such a period are available. Next, trends for more recent periods are studied. For Eastern Europe the period after 1990 is used (to exclude strong changes caused by political changes around 1990). For countries within the European Union the period after 1992 is important because in 1992 the Common Agricultural Policy went through important changes that affected yield and planted areas.

Besides changing the trend period, different trend functions are studied. Yield statistics of each country are directly taken from the CRONOS database which is updated each month. Linear, quadratic and other type of trends are studied. MARS analysts also study the minimum and maximum trend evolution by separating the data set in two groups representing the 50% highest and 50% lowest values.

Scenario analysis in SPSS

To deal with the residual uncertainty given by the unknown evolution of the season from the moment the forecast is issued to the moment the crop is harvested, agro-meteorological scenario's can be produced and analysed. The scenario analyses consist in finding the most similar agro-meteorological years basing on the time series of parameters simulated by the CGMS. The analysis is based on PCA, Factor Analysis and Cluster Analysis (Hair et al., 1998). As default input crop indicators of the CGMS of all available years are used Crop Simulation. It is stressed that the climatic similarities are established basing on the time series of agro-meteorological parameters. In fact year similar in climatology are not necessarily similar in crop response as small changes in the sequence of the meteorological events can have a major effect in crop behaviour, this is why the approach is run directly on the crop parameters.

The PCA gives a new combination of independent variables (factors). The first factors, explaining up to 90% of the variability, are selected and the combination of pairs of factors' axis are analysed using as unit the original variables. The Unit (years' observation) are then plotted on the new factors to characterise the years (for instance dry and hot season…).

This is repeated for each country and at crop level (we remind that the original variables are the crop growth parameters as simulated by the CGMS). The analyst launches then a cluster analysis on the new factors (normally a hierarchical cluster) obtaining groups of homogeneous year according to obtained factors. Similarity or dissimilarity matrixes help to put a hierarchy on similarities among years. Once the score of similarities obtained and the hierarchy of similarity obtained the forecast is obtained as weighted average of the corresponding yields (in case de-trended). Weights are given by the similarity indexes. From the cluster of similar years different simple statistics are also used: within the group of the similar year the maximum and the minimum values of yield are used for optimistic and pessimistic yield scenarios.

The routine used in SPSS is the following:

 /VARIABLES ds sm wlai plai wb pb twc twr
 /ANALYSIS ds sm wlai play wb pb twc twr
 /SCATTERPLOT(BIVAR)=fact1_1 WITH fac2_1 BY year (NAME)
CLUSTER fac1_1 fac2_1

In this example ds, sm, wlai, plai, wb, pb, twc, twr stand respectively for development stage, soil moisture, potential leaf area index, water limited biomass, potential biomass, total water consumption, total water requirements, and are the parameters simulated by CGMS.

The initial data are (a dekade, crop and country/region/grid fixed) the CGMS simulations per year (year are the units).

Step 1) FACTOR In this example we extract the main variable for a given crop simulated and country and with a FACTOR analysis we reduce to few variables explaining about 90% of the variability (2 in the example). In alternative to fixing a dekade the procedure can run on several dekades, thus the number of variables could substantially increase.

Step 2) GRAPH We then obtain plots (graphs) of the original units (years) on the new axis and this will result in characterizing the current season in terms of impact on crop growth, i.e. wet and cold; wet and hot; dry and cold; dry and hot.

Step 3) CLUSTER The third step is then used to look at the similar years as the graph factor analysis could not be sufficient to find these. The cluster algorithm is here on based. The similar years are determined looking at coefficients of dissimilarity produced in the distance analysis. These coefficients are used in two ways: 1 st detect the first ten similar years (or the ones similar below a defined threshold); 2 nd use them as weights to define a prediction.

Step 4) The fourth step is the prediction derived from the similar years (not in the routine above). The pairs (year, yield) belonging to the group of similar years. This will determine a range of yields and an average (min and max can be used as scenario min and scenario max where their explanation is given by the characterization of the factor analysis). The prediction is then obtained either using the average or (better) calculating a weighted average where the weights come from the dissimilarity coefficients. In case of presence of trend all the steps is in fact run on distances from trend (the trend model choice will affect all of the results).