library('tidyverse') # Wrapper for many convenient libraries
library('tsibble') # Working with time series tibbles
library('tsibbledata') # Contains data for the assignment
library('skimr') # Quickly summarise data
library('gt') # Render "great tables"
library('forecast')
library('tseries')
library('astsa')ARIMA, Time Series Forecasting | Medicare Perscription Volumes
Assignment Questions
Name
What is your name? Include all team members if submitting as a group.
Adam Bushman [u6049169]; no group members.
Perspective
From what perspective are you conducting the analysis? (Who are you? / Who are you working for?)
I am a medicare perscription plan manager, in charge of monitoring script dispenses across medicare plans. Its important for me to understand the anticipated scale (in volume and cost) of scripts for the next calendar year.
Question
What is your question?
What are the range of likely scenarios for monthly medicare perscription dispenses over the next 12 months?
Data set
Describe your dataset(s) including URL (if available)
The data set is titled Monthly Medicare Australia prescription data, sourced from the {tsibbledata} package. It was originally sourced from Medicare Australia.
Predictor(s) and target(s)
What is (are) your independent variable(s) and dependent variable(s)? Include variable type (binary, categorical, numeric).
Independent variables are Month; other covariates are excluded for simplicity. This value is a of month data type, representing the unique month and year.
Dependent variables are Scripts and Cost, each forecasted indpendently. They are of type double.
Additional detail may be seen in the data dictionary.
Model resonability
How are your variables suitable for your analysis method?
The data are already formatted for use in time series analysis and forecasting. A tsibble is a modified version of a tseries but adheres to “tidy” principles. Additionally, the variables are well suited for use in ARIMA since we have consistent time intervals and continuous target variables. Furthermore, the data present opportunities to leverage the strengths of ARIMA, such as “non-stationarity” and “auto-correlation”.
Conclusions
What are your conclusions (include references to one or two CLEARLY INDICATED AND IMPORTANT graphs or tables in your output)?
While exploring monthly Script and Cost data, it was found that the delivered format was demonstrative of seasonality and non-stationarity, elements well-suited for ARIMA to correct. Having met the assumptions of the forecasting model, we proceeded to fit the model.
The auto.arima modeling approach yielded the following parameters:
- \(d = 2\): “differencing” the values was done twice to achieve stationarity
- \(q = 1\): only 1 previous value’s error was included
- \(p = 2\): two lagged values were leveraged
The resulting model yielded forecasted values for the next twelve months, with an 80% confidence interval. The volume of scripts in the first forecasted month are anticipated to fall between 12.4M and 16.2M with 80% certainty.
Assumptions
What are your assumptions and limitations? Did you include any robustness checks?
A big limitation in the data are the non-normally distributed values of Cost and Scripts. Logarithmic transforms were extremely helpful in resolving the issue but complicated interpretation of the model and forecasted values throughout. I’d be interested to learn more how this is handled in industry.
Additionally, the data was presented with variables levels and grouped covariates for Concession, Type, and ATC1/2. It’s likely a medicare perscription program manager would be more interested in the varying forcasts that make up these slices. Due to knowledge gap in this area, it was resolved via aggregation.
It was also assumed that auto.arima would choose an optimal model. However, there was some suggestive evidence different values could have produced a similarly performant result but with more robustness. Ultimately, I ceded model choice to the algorithm. I could have run both and compared AICs.
Assignment Workflow
Analysis Prep
Loading packages
Loading the data
We’ll begin by referencing the data from {tsibbledata}; the “PBS” data are the collection of variables respective to Australian medicare perscriptions.
px_data <- tsibbledata::PBS # Load data into the sessionLet’s now get a sense for what we’re working with.
Data set inspection
Right away, I like to get acquainted with the data set. That means understanding what each column seeks to describe, confirming the granularity of the rows, and getting my arms around structure, completeness, distribution, etc.
The package designers for {tsibble} included the following data dictionary:
gt(px_data_dict) %>% # Create a "great tables" (gt) object
cols_label( # Rename some columns
variable = "Variable Name",
datatype = "Data Type",
description = "Variable Description"
) %>%
tab_options(
column_labels.background.color = "#d9230f", # Apply red header background
column_labels.font.weight = "bold", # Bold headers
row.striping.background = '#FFFFFF' # Remove row striping
)| Variable Name | Data Type | Variable Description |
|---|---|---|
| Month | <yearmonth> | A key for the year/month of the observation |
| Concession | <char> | Concessional scripts are given to pensioners, unemployed, dependents, and other card holders |
| Type | <char> | Co-payments are made until an individual's script expenditure hits a threshold ($290.00 for concession, $1,141.80 otherwise). Safety net subsidies are provided to individuals exceeding this amount |
| ATC1 | <char> | Code for Anatomical Therapeutic Chemical index (level 1) |
| ATC1_desc | <char> | Description for Anatomical Therapeutic Chemical index (level 1); divides drugs into 14 anatomical groups based on the primary part of the body or organ system they act upon |
| ATC2 | <char> | Code for Anatomical Therapeutic Chemical index (level 2) |
| ATC2_desc | <char> | Code for Anatomical Therapeutic Chemical index (level 2); specifies the type of treatment or function the drugs perform within the organ system |
| Scripts | <double> | Total number of scripts |
| Cost | <double> | Cost of the scripts in $AUD |
Using the {skimr} package, we can get a comprehensive summary of the data.
skim(px_data[,-1]) # Left out <yearmonth> data type| Name | px_data[, -1] |
| Number of rows | 67596 |
| Number of columns | 8 |
| _______________________ | |
| Column type frequency: | |
| character | 6 |
| numeric | 2 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Concession | 0 | 1.00 | 7 | 12 | 0 | 2 | 0 |
| Type | 0 | 1.00 | 10 | 11 | 0 | 2 | 0 |
| ATC1 | 0 | 1.00 | 1 | 1 | 0 | 15 | 0 |
| ATC1_desc | 3193 | 0.95 | 7 | 63 | 0 | 14 | 0 |
| ATC2 | 0 | 1.00 | 1 | 3 | 0 | 84 | 0 |
| ATC2_desc | 2377 | 0.96 | 1 | 63 | 0 | 84 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Scripts | 0 | 1 | 35096.17 | 106803.3 | 0 | 68.00 | 1225.50 | 13506.75 | 1607119 | ▇▁▁▁▁ |
| Cost | 0 | 1 | 868677.62 | 2987324.8 | 0 | 1743.93 | 33396.57 | 348131.28 | 72911261 | ▇▁▁▁▁ |
The data are mostly complete, thankfully. We see about 5% of records with missing ATC1_desc and ATC2_desc. These values aren’t strictly necessary for a simple forecasting exercise anyway.
I always like to see the data using the glimpse() function from {dplyr}:
head(px_data) |>
mutate(Month = yearmonth(Month))# A tsibble: 6 x 9 [1M]
# Key: Concession, Type, ATC1, ATC2 [1]
Month Concession Type ATC1 ATC1_desc ATC2 ATC2_desc Scripts Cost
<mth> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 1991 Jul Concessional Co-paymen… A Alimenta… A01 STOMATOL… 18228 67877
2 1991 Aug Concessional Co-paymen… A Alimenta… A01 STOMATOL… 15327 57011
3 1991 Sep Concessional Co-paymen… A Alimenta… A01 STOMATOL… 14775 55020
4 1991 Oct Concessional Co-paymen… A Alimenta… A01 STOMATOL… 15380 57222
5 1991 Nov Concessional Co-paymen… A Alimenta… A01 STOMATOL… 14371 52120
6 1991 Dec Concessional Co-paymen… A Alimenta… A01 STOMATOL… 15028 54299
Simple Exploratory Data Analysis
Tracking levels
It’s clear that the data are disaggregated. For example, the number of Scripts and total of Cost has been pieced up for the supplemental variables. Let’s get a sense for what that looks like.
Concession
px_data %>%
count(.by = Concession) %>%
mutate(share = n / sum(n))# A tibble: 2 × 3
.by n share
<chr> <int> <dbl>
1 Concessional 34020 0.503
2 General 33576 0.497
Type
px_data %>%
count(.by = Type) %>%
mutate(share = n / sum(n))# A tibble: 2 × 3
.by n share
<chr> <int> <dbl>
1 Co-payments 33612 0.497
2 Safety net 33984 0.503
ATC1
px_data %>%
count(.by = ATC1) %>%
mutate(share = n / sum(n))# A tibble: 15 × 3
.by n share
<chr> <int> <dbl>
1 A 10140 0.150
2 B 3264 0.0483
3 C 7344 0.109
4 D 8100 0.120
5 G 3264 0.0483
6 H 4080 0.0604
7 J 4848 0.0717
8 L 3216 0.0476
9 M 4080 0.0604
10 N 4896 0.0724
11 P 2412 0.0357
12 R 3936 0.0582
13 S 3120 0.0462
14 V 4080 0.0604
15 Z 816 0.0121
Conclusion
This forecasting exercise is supposed to be simple. We aren’t meant to generate multiple, grouped forecasts. We’ll assume a slice or fully aggregated data set is of interest to the user.
Assessing distribution
The variables available to us for forecasting are Scripts and Cost. Let’s get a sense for their distribution.
First, to make plotting easier, let’s pivot the data to a long format:
px_pivot <- px_data |>
as_tibble() |>
select(Scripts, Cost) |> # Restrict data to variables of interest
pivot_longer( # Pivot to a long version
cols = everything(),
names_to = "measure",
values_to = "value"
)
head(px_pivot)# A tibble: 6 × 2
measure value
<chr> <dbl>
1 Scripts 18228
2 Cost 67877
3 Scripts 15327
4 Cost 57011
5 Scripts 14775
6 Cost 55020
A great way to do visualize distribution is with a “raincloud” plot, which combines a traditional boxplot, scatter points, and a density shape. In this way, a complete understanding is had over the distribution of values. The below code is adopted from Cedric Scherer’s work.
ggplot( # Setup ggplot object
px_pivot,
aes(x = "1", y = value)
) +
ggdist::stat_halfeye( # Distribution plot
adjust = 0.5,
width = 0.6,
.width = 0,
justification = -0.2,
point_color = NA
) +
geom_boxplot( # Boxplot
width = 0.15,
outlier.shape = NA
) +
gghalves::geom_half_point( # Individual observations
side = "l",
range_scale = 0.4,
alpha = 0.2
) +
facet_wrap(~measure, nrow = 1) + # Generate two plots
theme_minimal() + # Theme styling
theme(
axis.title.x = element_blank(),
axis.title.y = element_blank(),
axis.text.x = element_blank(),
panel.grid.major.x = element_blank(),
strip.background = element_rect(fill = "#BE0000", color = NA),
strip.text = element_text(color = "white", face = "bold")
)Clearly the data are highly skewed. This opens up the possibility for a logarithmic transformation. Let’s run the same code by using a log scale:
ggplot( # Setup ggplot object
px_pivot,
aes(x = "1", y = value)
) +
ggdist::stat_halfeye( # Distribution plot
adjust = 0.5,
width = 0.6,
.width = 0,
justification = -0.2,
point_color = NA
) +
geom_boxplot( # Boxplot
width = 0.15,
outlier.shape = NA
) +
gghalves::geom_half_point( # Individual observations
side = "l",
range_scale = 0.4,
alpha = 0.2
) +
scale_y_log10() +
facet_wrap(~measure, nrow = 1) + # Generate two plots
theme_minimal() + # Theme styling
theme(
axis.title.x = element_blank(),
axis.title.y = element_blank(),
axis.text.x = element_blank(),
panel.grid.major.x = element_blank(),
strip.background = element_rect(fill = "#BE0000", color = NA),
strip.text = element_text(color = "white", face = "bold")
)That’s much improved! Such a transform may be a great way to improve our eventual forecasting model and better align with the assumptions of ARIMA.
Trends
This is a forecasting exercise. As such, we’d be remiss if not to plot the trend. Let’s first fully aggregate the data and then perform another pivot for a long format:
px_trend <-
px_data |>
as_tibble() |>
summarise(
Total_Scripts = sum(Scripts),
Total_Cost = sum(Cost),
.by = Month
) |>
pivot_longer(
cols = -Month,
names_to = "measure",
values_to = "values"
)
head(px_trend)# A tibble: 6 × 3
Month measure values
<mth> <chr> <dbl>
1 1991 Jul Total_Scripts 8090395
2 1991 Jul Total_Cost 90494401
3 1991 Aug Total_Scripts 7308118
4 1991 Aug Total_Cost 83068686
5 1991 Sep Total_Scripts 7614644
6 1991 Sep Total_Cost 87399990
Let’s now plot the trend for both measures. We’ll apply a logarithmic transform right away since we know both measures are highly skewed.
ggplot(
px_trend,
aes(Month, values)
) +
geom_line() +
scale_y_log10() +
facet_wrap(~measure, ncol = 1, scales = "free") + # Generate two plots
theme_minimal() + # Theme styling
theme(
axis.title.y = element_blank(),
axis.title.x = element_blank(),
panel.grid.major.x = element_blank(),
strip.background = element_rect(fill = "#BE0000", color = NA),
strip.text = element_text(color = "white", face = "bold")
)We’ve got some nice plots now. Clearly the values are not stationary but that’s a matter for the next section as we prepare the data for forecasting.
Preprocessing
Data cleaning / Feature Engineering
As we’ve already assessed, there’s no “cleaning” to perform on the data since everything we’re going to use is complete. However, we do want to do some feature engineering. We’re going to tackle a handful of issues/checks:
- We need to formally aggregate the data so we haven’t multiple levels per point in time
- We need resolve the skewness; we’ve already shown how to do that with a logarithmic transform
- We need to confirm non-stationarity; an ARIMA model requires no trend
- We also need to assess “auto-correlation”, so the model knows how much to error to use in the moving average and how many lagged points to include
Aggregation
px_agg <-
px_data |>
as_tibble() |>
summarise(
Total_Scripts = sum(Scripts),
Total_Cost = sum(Cost),
.by = Month
) |>
mutate(Month = yearmonth(Month)) |>
tsibble(index = Month)
head(px_agg)# A tsibble: 6 x 3 [1M]
Month Total_Scripts Total_Cost
<mth> <dbl> <dbl>
1 1991 Jul 8090395 90494401
2 1991 Aug 7308118 83068686
3 1991 Sep 7614644 87399990
4 1991 Oct 8206248 96884880
5 1991 Nov 7686634 93892910
6 1991 Dec 8954872 112612920
Resolve skewness
px_norm <-
px_agg |>
mutate(across(
Total_Scripts:Total_Cost,
~log(.),
.names = "{.col}_log"
))
head(px_norm)# A tsibble: 6 x 5 [1M]
Month Total_Scripts Total_Cost Total_Scripts_log Total_Cost_log
<mth> <dbl> <dbl> <dbl> <dbl>
1 1991 Jul 8090395 90494401 15.9 18.3
2 1991 Aug 7308118 83068686 15.8 18.2
3 1991 Sep 7614644 87399990 15.8 18.3
4 1991 Oct 8206248 96884880 15.9 18.4
5 1991 Nov 7686634 93892910 15.9 18.4
6 1991 Dec 8954872 112612920 16.0 18.5
Stationarity
We confirm stationarity using the “Augmented Dickey-Fuller Test”. We’ll do it for each of our logarithmically transformed variables, Total_Scripts_log and Total_Cost_log:
adf.test(px_norm$Total_Scripts_log, alternative = "stationary")
Augmented Dickey-Fuller Test
data: px_norm$Total_Scripts_log
Dickey-Fuller = -5.5837, Lag order = 5, p-value = 0.01
alternative hypothesis: stationary
adf.test(px_norm$Total_Cost_log, alternative = "stationary")
Augmented Dickey-Fuller Test
data: px_norm$Total_Cost_log
Dickey-Fuller = -3.5836, Lag order = 5, p-value = 0.03617
alternative hypothesis: stationary
Are the data stationary? Well, we set the alternative hypothesis to be “stationary”, therefore a statistically significant result from the test (very low p-value) would indicate they are. And that’s what we’re getting.
However, we can see some trend in the data. Let’s difference the values and see the results.
adf.test(diff(px_norm$Total_Scripts_log), alternative = "stationary")
Augmented Dickey-Fuller Test
data: diff(px_norm$Total_Scripts_log)
Dickey-Fuller = -8.8806, Lag order = 5, p-value = 0.01
alternative hypothesis: stationary
adf.test(diff(px_norm$Total_Cost_log), alternative = "stationary")
Augmented Dickey-Fuller Test
data: diff(px_norm$Total_Cost_log)
Dickey-Fuller = -8.9576, Lag order = 5, p-value = 0.01
alternative hypothesis: stationary
THere’s “some” improvement in the Total_Cost_log, but its marginal. What if we plot, quickly:
ggplot(
px_norm |>
mutate(across(
c(Total_Cost_log, Total_Scripts_log),
~difference(.)
)) |>
as_tibble() |>
pivot_longer(
cols = c(Total_Scripts_log, Total_Cost_log),
names_to = "measure", values_to = "values"
),
aes(Month, values)
) +
geom_line() +
facet_wrap(~measure, ncol = 1, scales = "free") + # Generate two plots
theme_minimal() + # Theme styling
theme(
axis.title.y = element_blank(),
axis.title.x = element_blank(),
panel.grid.major.x = element_blank(),
strip.background = element_rect(fill = "#BE0000", color = NA),
strip.text = element_text(color = "white", face = "bold")
)That looks A LOT more stationary. We would say that \(d=1\) for each value is sufficient in an ARIMA model.
px_diff <-
px_norm |>
mutate(
Total_Scripts_log_diff = difference(Total_Scripts_log),
Total_Cost_log_diff = difference(Total_Cost_log)
)
head(px_diff)# A tsibble: 6 x 7 [1M]
Month Total_Scripts Total_Cost Total_Scripts_log Total_Cost_log
<mth> <dbl> <dbl> <dbl> <dbl>
1 1991 Jul 8090395 90494401 15.9 18.3
2 1991 Aug 7308118 83068686 15.8 18.2
3 1991 Sep 7614644 87399990 15.8 18.3
4 1991 Oct 8206248 96884880 15.9 18.4
5 1991 Nov 7686634 93892910 15.9 18.4
6 1991 Dec 8954872 112612920 16.0 18.5
# ℹ 2 more variables: Total_Scripts_log_diff <dbl>, Total_Cost_log_diff <dbl>
Auto-correlation
Finally, let’s resolve any auto-correlation issues. We can assess this in two ways:
acf(px_diff$Total_Scripts_log_diff[-1])acf(px_diff$Total_Cost_log_diff[-1])In these plots, we’re seeing 1 significant correlation (the second bar from the left). For an ARIMA model, we’ll go with \(q=1\).
Next, let’s look at partial auto-correlation to determine \(p\):
pacf(px_diff$Total_Scripts_log_diff[-1])pacf(px_diff$Total_Cost_log_diff[-1])These plots show three (3) significant correlations so we would use \(p=3\) in an ARIMA model.
Of course, these values are slightly fluid. We could run multiple models testing different values and comparing the resulting AIC. The model with the lowest AIC is the best performing.
Forecasting model
Auto-ARIMA
For this exercise, we’re going to let ARIMA find the best model. It will go through the same process we just did, finding optimal valuse of \(d, q, p\).
px_Scripts_fit <- auto.arima(px_norm$Total_Scripts_log)px_Cost_fit <- auto.arima(px_norm$Total_Cost_log)Evaluation
Let’s see how the models performed, starting with the forecast for Scripts. We can use the checkresiduals() function:
checkresiduals(px_Scripts_fit)
Ljung-Box test
data: Residuals from ARIMA(2,1,2) with drift
Q* = 9.3938, df = 6, p-value = 0.1526
Model df: 4. Total lags used: 10
Interestingly, the model chose \(d=2\) (we thought \(d=1\) was sufficient), \(q=1\) (same as our analysis), and \(p=2\) (we estimated \(p=3\)). Looking at the plots, the model is only fair.
We can take this model and plot the forecast:
fitted_mod <- forecast(px_Scripts_fit, h=12, level = c(80))
autoplot(fitted_mod) #+ #scale_y_continuous(trans = scales::transform_exp())Remember, those y-axis values for total number of scripts are on a logarithmic scale. We can extract the forecasted values from the model and transform them back to a linear scale:
tibble(
lower = exp(fitted_mod$lower),
mean = exp(fitted_mod$mean),
upper = exp(fitted_mod$upper)
)# A tibble: 12 × 3
lower[,"80%"] mean upper[,"80%"]
<ts> <dbl> <ts>
1 12447660 14186143. 16167428
2 13112376 14997233. 17153031
3 13205892 15106715. 17281138
4 13056650 14945642. 17107928
5 12876344 14742098. 16878194
6 12739058 14585073. 16698594
7 12654614 14493220. 16598959
8 12612583 14457097. 16571359
9 12599541 14460088. 16595379
10 12604312 14487084. 16651095
11 12618968 14527082. 16723721
12 12638468 14573130. 16803945
We can interpret this as total scripts during the first forecasted month will fall between 12.4M and 16.2M with 80% certainty.