Evaluating the quality of predictions is an essential component of seasonal forecasting. Verification information is now available online for IRI’s seasonal probabilistic forecasts from 1997 to present. This is part of IRI’s larger goal to define, implement, and promote a set of verification procedures that provide accurate and easily understandable information for communicating the quality and potential value of seasonal forecast products.
Go to the verification page.
What makes a good forecast? Surely a good forecast is one in which the outcome is consistent with what was predicted, and is “correct” or “accurate.”
Although the interpretation above may be reasonable for some forecasts, for seasonal forecasts that are issued as probabilities, there is no one specific value that can be compared to the observed value. For example, if the observed temperature is 31°C (about 88F), and the forecast was 30°C (86F), it is clear that the forecast was 1°C (or about 2F) too cold, and there are various ways of deciding whether that error is large or small. However, if the forecast indicates that there is a 70% probability of warmer than 30°C (“above-normal”) and a 10% probability of temperatures colder than 25°C (“below-normal”), it is not possible to assess the extent of the error in the same way as above because there is no one specific temperature that was forecast. Even in cases in which the observed temperature did exceed 30°C, we cannot easily conclude that the forecast was “correct” or “accurate”. Also, how would we distinguish between cases in which the forecasted probability was 80%? Or 20%? To measure the quality of probabilistic forecasts we have to ask whether the probabilities are “good.” Unfortunately, it is not obvious how to compare the probabilities with the observation and there is no one question that gives an adequate answer to whether the probabilities are good. Because of this, we assess the quality of forecasts based on a number of properties, or “attributes”.
Attributes of good probabilistic forecasts
- Good probabilistic forecasts have a high probability (“sharpness”) for the observed outcome.
If the observed temperature (31°C) was above-normal (warmer than 30°C) and the forecast probability was 80% it might seem a better forecast than if the probability had been 20%. Ideally the category with the highest probability should occur most frequently, but what then is the difference between a forecast with 80% on the observed category, and one with 70%? If high probability on the observed outcome was the only criteria, why not just issue 100% probability on the most likely outcome?
- The probabilities themselves need to be “reliable” indications of how frequently that category will occur.
It is impossible to determine what the actual probability should have been for any one forecast, but if the categories with forecast probabilities of 70% occur substantially more or less frequently than 70% of the time those probabilities are unreliable. Reliable probabilities alone do not necessarily make good forecasts: if the forecasts only indicate how often each category occurs on average they will be reliable, but will provide no information as to which of the categories is more or less likely to occur on this specific occasion. Just as sharpness can be achieved with the loss of reliability, so reliability can be achieved with the loss of sharpness.
- A category should occur more frequently as its probability increases, and less frequently as the probability decreases (“resolution”).
The probability for a category should indicate how confident we can be that that category will occur (reliability), but if the probabilities are unreliable they may still be meaningful. If the forecast probability for above-normal temperature increases from 25% to 50%, for example, its actual probability may not double because of imperfect reliability, but its probability should at least increase by some amount. A category should occur more frequently when its probability is high than when its probability is low. Resolution measures whether the outcome differs if the forecast differs: if, on average, the outcome is the same regardless of the forecast, the forecasts lack resolution and are essentially useless.
- The probability for a category should increase as that category occurs more frequently and its probability should decrease as it occurs less frequently (“discrimination”).
Whereas resolution measures whether the outcome differs if the forecast differs, discrimination is a measure of whether the forecast differs given different outcomes. When temperatures were above-normal forecasts should have higher probabilities for above-normal temperature than when temperatures were unusually cold: if, on average, the forecast is the same regardless of the outcome, the forecasts are essentially useless.
The attributes of good probabilistic forecasts described above indicate how well the forecasts compare with the observations. However, even good probabilistic forecasts may not be potentially useful, especially if the forecasts are not good in all the attributes detailed above. Because the actual benefit of forecasts depends partly upon how they are used, estimates of their value will be specific to individual decision-making contexts. However, if some assumptions are made regarding how decisions are made and what the costs and benefits of different outcomes, hypothetical estimates of forecast value can be made.
Forecast Quality Scores
|Measures of Forecast Quality||Questions Answered|
|Discrimination||If the outcomes are different, are the forecasts different? For example, does the probability for a category increase if the category occurs more frequently?|
|Resolution||If the forecasts are different, are the outcomes different? For example, does the category occur more frequently as its probability increases?|
|Reliability||Do the forecasts have an appropriate level of confidence?|
|Unconditional Bias||Have the forecasts successfully indicated any short-term shifts in climate?|
|Number of hits||How often does the category with the highest (or second or third highest) probability occur?|
|Value||Could use of this forecast result in any benefit (or loss)?|