PLANES Interpretation
interpretation.Rmd
Overview
The rplanes
package includes vignettes detailing basic usage and descriptions of the individual components that the package
uses for plausibility analysis. This vignette focuses on interpreting
rplanes
plausibility outputs. The content includes a primer
on the weighting scheme for plausibility components, a discussion of
limitations that may arise from relying on seed data, and considerations
for what to do when a flag is raised.
Weighting scheme within plane_score()
The plane_score()
function allows users to evaluate
multiple plausibility components simultaneously (i.e., without having to
call individual scoring functions for each given component). This
wrapper returns an overall score to summarize all components evaluated.
The plane_score()
function includes an optional “weights”
argument that allows user-specified weighting of components in the
overall score. By default (weights = NULL
), each component
is given equal weighting. To optionally weight certain components higher
or lower, the user must specify a named vector with each value
representing the weight given to each component. The length of the
vector must equal the number of components used in the “components”
argument of plane_score()
. For more technical details about
plane_score()
or how to apply the weighting scheme, users
should refer to the function documentation (?plane_score
)
or the basic usage vignette.
Motivations for weighting
The weighting scheme is incorporated because scores may be
context-dependent. In other words, users may have varying concerns about
specific components being evaluated given the timing, historical data
patterns, and specific goals of their plausibility assessment. Below we
have included several examples highlighting use-cases for applying the
weighting scheme to plane_score()
:
- If users are retrospectively analyzing forecast signals after the
ground-truth, observed data for that horizon has been reported, they may
be aware that a large jump in reported cases actually occurred. In this
scenario, they might be less interested in the difference
component, which evaluates the forecast and raises a flag if there is a
point-to-point difference greater than any difference found in the
observed seed. However, the users might still be concerned about any
unreasonable jumps in forecasted cases. Rather than eliminating the
difference component altogether, the users can reduce its weight within
plane_score()
. - During an expected large uptick in cases, users might increase the weight of the trend component to more heavily penalize unexpected dips in cases. This ensures that significant trends are given the appropriate emphasis in the analysis.
- At the beginning of a season, when zeros may be very common in some locations, users might decide to reduce the relative weight of the zero and repeat components. This adjustment may help account for the seasonality and expected patterns in the data.
- When working with relatively short time series seed data (e.g., only several months), users may encounter many shape flags due to the limited number of shapes found in the seed data. In such cases, users can reduce the relative weight of the shape component.
- If evaluating forecasts operationally, users may be more interested in ensuring that prediction intervals are appropriately calibrated (e.g., not too narrow). In this case, the cover and taper components might merit higher weighting.
Limitations that arise from seed data
As described in the basic usage
vignette, the rplanes
plausibility analysis procedure
requires establishing a “seed” object based on an observed signal. The
seed data serves as the basis for the background characteristics used to
assess plausibility. As such, plausibility results will depend upon the
reliability and length of the time series used to establish the seed
data. Here we discuss how both of these factors could potentially impact
plausibility analysis with rplanes
.
Reliability of seed data
The plausibility analysis in rplanes
assumes that users
have access to observed data to establish baseline characteristics of
the time series. Presumably, this observed signal is trustworthy and is
a faithful representation of what one should expect from the signal to
be evaluated. However, in practice, data issues such as lagged
reporting, backfill, and other systematic biases in ascertainment may
lead to unexpected behaviors in rplanes
(i.e., too many or
too few flags raised). If there are known issues within the seed data,
users should carefully consider all individual components used in the
plausibility scoring and determine how they might impact the results of
the plausibility analysis. For example, consider observed data that may
lack consistent reporting in certain locations, particularly early on in
the time series. In this case, the seed data might contain many
consecutive zeros across a long time span followed by more reliable
reporting at these locations, and therefore rplanes
would
rarely (if ever) raise flags for the repeat and zero
components. For this scenario, truncating the observed data to begin
when reporting becomes more reliable before creating the seed may be
appropriate.
Length of seed data
Besides the reliability, the length (i.e., number of observations) of
the observed signal used to create the seed can influence the
rplanes
plausibility scoring. In general, more data
provides higher resolution for the characteristics that could manifest
in the evaluated signal. However, users should be aware of computational
costs and a potential reduction in sensitivity in some components as the
amount of the available seed data increases. Here we provide several
considerations and examples of balancing the length of observed data
used to create a seed.
In scenarios when the seed has been created with a relatively small
number of observations, users may notice that some components have
higher sensitivity resulting in more flags raised. Some of the
individual components have a required seed to signal length ratio that
must be met for the function to run (e.g., shape requires that
the seed is at least four times the length of the forecast being
evaluated). However, even with a built-in minimum length, there are
cases where the seed data may be too short to produce reasonable
results. For example, consider evaluating a forecast of four weeks
ahead. In this case, the seed must contain at least 16 weekly
observations for the given location. However, 16 weeks is roughly four
months of data, which (depending on seasonality and timing of the
observations) may not adequately capture all of the shapes that could
plausibly manifest four weeks into the future. Below is a table
detailing some potential complications caused by a seed object that is
too short. For all of these possible issues, we recommend that users
should manually examine flagged locations as feasible and consider
reducing their relative weights within the plane_score()
function.
Component | Description | Issue |
---|---|---|
Difference | Point-to-point difference | Without enough point-to-point differences evaluated in the seed data, a true and reasonable jump in the signal may be flagged as implausible. |
Repeat | Values repeat more than expected | If there are no repeats in the seed, a single repeat will not be tolerated and will be flagged. Decreasing the prepend length and/or increasing the repeat tolerance should help mitigate this. |
Shape | Shape of signal trajectory has not been observed in seed data | A short seed object is comprised of fewer signal trajectory shapes that can be compared to the signal, so a reasonable trajectory might be erroneously flagged. |
Zero | Zeros found in signal when not in seed | If there are no zeros in the seed, a single zero in the signal will not be tolerated and will be flagged. |
While too few observations in the seed may lead to limitations
described above, too many seed values can also trigger unexpected
behavior. As the amount of seed data increases, the plausibility
components will have lower sensitivity, which my result in fewer flags.
The rplanes
package includes options to mitigate this. When
using the repeat component, increasing the “prepend” length and
decreasing the repeat tolerance should increase the sensitivity. In some
circumstances, the decreased sensitivity cannot be mitigated by
adjusting component parameters. When evaluating the shape
component, a longer seed time-series will likely contain more unique
shapes, resulting in fewer potentially novel shapes in forecasts and
therefore fewer flags being raised. Having observed a similar
epidemiological signal before is not (alone) enough justification to
infer that this shape is not unusual and should not be flagged. For
example, it may be appropriate to flag forecasts of COVID-19 activity in
2024 that exhibit trajectories similar to the most extreme surges in
2020-2021. However, in this situation if a user’s reference seed data
included pandemic activity levels, the flag would not be raised.
Unlike situations when the seed data is too short, changing the
relative weights of components not flagged will not have as much of an
impact on overall scores. Down-weighting the components when using a
short seed object causes potentially erroneous flags to have less of an
effect on the overall score, however increasing the weights will not
cause more flags to be raised (i.e., changing the weights does not
change the sensitivity of components). Further, manual inspection of
“missed flags” is much more challenging and time-consuming. We suggest
that if users suspect that the sensitivity of components in
plane_score()
is negatively impacted by having too many
seed values that they consider truncating the data prior to creating the
seed.
What to do when a flag is raised
The rplanes
package provides a mechanism to review
plausibility of epidemiological signals. When considering how to
interpret results, it is paramount to distinguish between signals that
may be plausible versus possible. A signal perceived
as implausible, may come to reflect true patterns in reporting once the
horizons evaluated have been eclipsed. We recommend that users consider
plausibility analysis primarily as a guide rather than a replacement for
subsequent manual review.
To the extent that is feasible, users may consider manually
inspecting flagged signals. If inspecting flags, we suggest that users
plot the observed seed data along with the signal being evaluated. If
many flags are raised and the signal appears implausible to subject
matter experts, users can likely accept the plausibility score and
either censor or adjust the forecast or observed signal being evaluated.
The score could also be used as a downstream weight for this forecast
(e.g., in an ensemble model). If flags are raised but the signal does
not appear implausible, inspect the individual flagged components.
Adjusting the arguments for repeat and trend can
increase or decrease their sensitivity. Short seed data can cause
certain components to be overly-sensitive (diff,
repeat, shape, and zero), and users can
either weight these components less within plane_score()
or
remove them altogether.
Before drawing any conclusions from rplanes
results, we
recommend that users first analyze retrospective data with the package
to understand the distribution of flags that they should expect in their
signal. For example, if reviewing operational forecasts of flu
hospitalizations, users may consider retrieving historical forecasts for
the same signal, retrospectively masking the observed data for each
available forecast week, and summarizing the plausibility scores. Such
an analysis will provide critical insight as to the baseline sensitivity
of the signal to scoring. Users may consider setting thresholds for
action on future evaluated signals based on the distribution of flags
raised in this analysis.
Summary
There are many reasons that a user might want to change the relative
weights of individual components or leave components out of the
plane_score()
function altogether. Manual inspection of
raised flags may be informative, particularly if users suspect that
flags are raised erroneously (for any of the reasons discussed in this
vignette). We also recommend collecting (or simulating) a few signals
that you would expect to trigger flags for calibration purposes. Lastly,
we suggest that retrospective analysis of plausibility scores as a batch
(across multiple evaluation time points) can be highly informative for
guiding interpretation.