ARIMA, Time Series Forecasting | Medicare Perscription Volumes

Forecasting the monthly volume of medicare perscription dispenses using the ARIMA model
Author

Adam Bushman

Published

October 15, 2024

Assignment Questions

Name

What is your name? Include all team members if submitting as a group.

Adam Bushman [u6049169]; no group members.

Perspective

From what perspective are you conducting the analysis? (Who are you? / Who are you working for?)

I am a medicare perscription plan manager, in charge of monitoring script dispenses across medicare plans. Its important for me to understand the anticipated scale (in volume and cost) of scripts for the next calendar year.

Question

What is your question?

What are the range of likely scenarios for monthly medicare perscription dispenses over the next 12 months?

Data set

Describe your dataset(s) including URL (if available)

The data set is titled Monthly Medicare Australia prescription data, sourced from the {tsibbledata} package. It was originally sourced from Medicare Australia.

Predictor(s) and target(s)

What is (are) your independent variable(s) and dependent variable(s)? Include variable type (binary, categorical, numeric).

Independent variables are Month; other covariates are excluded for simplicity. This value is a of month data type, representing the unique month and year.

Dependent variables are Scripts and Cost, each forecasted indpendently. They are of type double.

Additional detail may be seen in the data dictionary.

Model resonability

How are your variables suitable for your analysis method?

The data are already formatted for use in time series analysis and forecasting. A tsibble is a modified version of a tseries but adheres to “tidy” principles. Additionally, the variables are well suited for use in ARIMA since we have consistent time intervals and continuous target variables. Furthermore, the data present opportunities to leverage the strengths of ARIMA, such as “non-stationarity” and “auto-correlation”.

Conclusions

What are your conclusions (include references to one or two CLEARLY INDICATED AND IMPORTANT graphs or tables in your output)?

While exploring monthly Script and Cost data, it was found that the delivered format was demonstrative of seasonality and non-stationarity, elements well-suited for ARIMA to correct. Having met the assumptions of the forecasting model, we proceeded to fit the model.

The auto.arima modeling approach yielded the following parameters:

  • \(d = 2\): “differencing” the values was done twice to achieve stationarity
  • \(q = 1\): only 1 previous value’s error was included
  • \(p = 2\): two lagged values were leveraged

The resulting model yielded forecasted values for the next twelve months, with an 80% confidence interval. The volume of scripts in the first forecasted month are anticipated to fall between 12.4M and 16.2M with 80% certainty.

Assumptions

What are your assumptions and limitations? Did you include any robustness checks?

A big limitation in the data are the non-normally distributed values of Cost and Scripts. Logarithmic transforms were extremely helpful in resolving the issue but complicated interpretation of the model and forecasted values throughout. I’d be interested to learn more how this is handled in industry.

Additionally, the data was presented with variables levels and grouped covariates for Concession, Type, and ATC1/2. It’s likely a medicare perscription program manager would be more interested in the varying forcasts that make up these slices. Due to knowledge gap in this area, it was resolved via aggregation.

It was also assumed that auto.arima would choose an optimal model. However, there was some suggestive evidence different values could have produced a similarly performant result but with more robustness. Ultimately, I ceded model choice to the algorithm. I could have run both and compared AICs.

Assignment Workflow

Analysis Prep

Loading packages

library('tidyverse')        # Wrapper for many convenient libraries
library('tsibble')          # Working with time series tibbles
library('tsibbledata')      # Contains data for the assignment
library('skimr')            # Quickly summarise data
library('gt')               # Render "great tables"

library('forecast')
library('tseries')
library('astsa')

Loading the data

We’ll begin by referencing the data from {tsibbledata}; the “PBS” data are the collection of variables respective to Australian medicare perscriptions.

px_data <- tsibbledata::PBS          # Load data into the session

Let’s now get a sense for what we’re working with.

Data set inspection

Right away, I like to get acquainted with the data set. That means understanding what each column seeks to describe, confirming the granularity of the rows, and getting my arms around structure, completeness, distribution, etc.

The package designers for {tsibble} included the following data dictionary:

gt(px_data_dict) %>%                                   # Create a "great tables" (gt) object
    cols_label(                                         # Rename some columns
        variable = "Variable Name", 
        datatype = "Data Type", 
        description = "Variable Description"
    ) %>%
    tab_options(
        column_labels.background.color = "#d9230f",     # Apply red header background
        column_labels.font.weight = "bold",             # Bold headers
        row.striping.background = '#FFFFFF'             # Remove row striping
    )
Variable Name Data Type Variable Description
Month <yearmonth> A key for the year/month of the observation
Concession <char> Concessional scripts are given to pensioners, unemployed, dependents, and other card holders
Type <char> Co-payments are made until an individual's script expenditure hits a threshold ($290.00 for concession, $1,141.80 otherwise). Safety net subsidies are provided to individuals exceeding this amount
ATC1 <char> Code for Anatomical Therapeutic Chemical index (level 1)
ATC1_desc <char> Description for Anatomical Therapeutic Chemical index (level 1); divides drugs into 14 anatomical groups based on the primary part of the body or organ system they act upon
ATC2 <char> Code for Anatomical Therapeutic Chemical index (level 2)
ATC2_desc <char> Code for Anatomical Therapeutic Chemical index (level 2); specifies the type of treatment or function the drugs perform within the organ system
Scripts <double> Total number of scripts
Cost <double> Cost of the scripts in $AUD

Using the {skimr} package, we can get a comprehensive summary of the data.

skim(px_data[,-1])          # Left out <yearmonth> data type
Data summary
Name px_data[, -1]
Number of rows 67596
Number of columns 8
_______________________
Column type frequency:
character 6
numeric 2
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Concession 0 1.00 7 12 0 2 0
Type 0 1.00 10 11 0 2 0
ATC1 0 1.00 1 1 0 15 0
ATC1_desc 3193 0.95 7 63 0 14 0
ATC2 0 1.00 1 3 0 84 0
ATC2_desc 2377 0.96 1 63 0 84 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Scripts 0 1 35096.17 106803.3 0 68.00 1225.50 13506.75 1607119 ▇▁▁▁▁
Cost 0 1 868677.62 2987324.8 0 1743.93 33396.57 348131.28 72911261 ▇▁▁▁▁


The data are mostly complete, thankfully. We see about 5% of records with missing ATC1_desc and ATC2_desc. These values aren’t strictly necessary for a simple forecasting exercise anyway.

I always like to see the data using the glimpse() function from {dplyr}:

head(px_data) |>
    mutate(Month = yearmonth(Month))
# A tsibble: 6 x 9 [1M]
# Key:       Concession, Type, ATC1, ATC2 [1]
     Month Concession   Type       ATC1  ATC1_desc ATC2  ATC2_desc Scripts  Cost
     <mth> <chr>        <chr>      <chr> <chr>     <chr> <chr>       <dbl> <dbl>
1 1991 Jul Concessional Co-paymen… A     Alimenta… A01   STOMATOL…   18228 67877
2 1991 Aug Concessional Co-paymen… A     Alimenta… A01   STOMATOL…   15327 57011
3 1991 Sep Concessional Co-paymen… A     Alimenta… A01   STOMATOL…   14775 55020
4 1991 Oct Concessional Co-paymen… A     Alimenta… A01   STOMATOL…   15380 57222
5 1991 Nov Concessional Co-paymen… A     Alimenta… A01   STOMATOL…   14371 52120
6 1991 Dec Concessional Co-paymen… A     Alimenta… A01   STOMATOL…   15028 54299

Simple Exploratory Data Analysis

Tracking levels

It’s clear that the data are disaggregated. For example, the number of Scripts and total of Cost has been pieced up for the supplemental variables. Let’s get a sense for what that looks like.

Concession

px_data %>%
    count(.by = Concession) %>%
    mutate(share = n / sum(n))
# A tibble: 2 × 3
  .by              n share
  <chr>        <int> <dbl>
1 Concessional 34020 0.503
2 General      33576 0.497

Type

px_data %>%
    count(.by = Type) %>%
    mutate(share = n / sum(n))
# A tibble: 2 × 3
  .by             n share
  <chr>       <int> <dbl>
1 Co-payments 33612 0.497
2 Safety net  33984 0.503

ATC1

px_data %>%
    count(.by = ATC1) %>%
    mutate(share = n / sum(n))
# A tibble: 15 × 3
   .by       n  share
   <chr> <int>  <dbl>
 1 A     10140 0.150 
 2 B      3264 0.0483
 3 C      7344 0.109 
 4 D      8100 0.120 
 5 G      3264 0.0483
 6 H      4080 0.0604
 7 J      4848 0.0717
 8 L      3216 0.0476
 9 M      4080 0.0604
10 N      4896 0.0724
11 P      2412 0.0357
12 R      3936 0.0582
13 S      3120 0.0462
14 V      4080 0.0604
15 Z       816 0.0121

Conclusion

This forecasting exercise is supposed to be simple. We aren’t meant to generate multiple, grouped forecasts. We’ll assume a slice or fully aggregated data set is of interest to the user.

Assessing distribution

The variables available to us for forecasting are Scripts and Cost. Let’s get a sense for their distribution.

First, to make plotting easier, let’s pivot the data to a long format:

px_pivot <- px_data |> 
    as_tibble() |>
    select(Scripts, Cost) |>        # Restrict data to variables of interest        
    pivot_longer(                   # Pivot to a long version
        cols = everything(),                        
        names_to = "measure", 
        values_to = "value"
    )

head(px_pivot)
# A tibble: 6 × 2
  measure value
  <chr>   <dbl>
1 Scripts 18228
2 Cost    67877
3 Scripts 15327
4 Cost    57011
5 Scripts 14775
6 Cost    55020

A great way to do visualize distribution is with a “raincloud” plot, which combines a traditional boxplot, scatter points, and a density shape. In this way, a complete understanding is had over the distribution of values. The below code is adopted from Cedric Scherer’s work.

ggplot(                                                                         # Setup ggplot object
    px_pivot, 
    aes(x = "1", y = value)
) +                                    
    ggdist::stat_halfeye(                                                       # Distribution plot
        adjust = 0.5, 
        width = 0.6, 
        .width = 0, 
        justification = -0.2, 
        point_color = NA
    ) +
    geom_boxplot(                                                               # Boxplot
        width = 0.15, 
        outlier.shape = NA
    ) +
    gghalves::geom_half_point(                                                  # Individual observations
        side = "l", 
        range_scale = 0.4, 
        alpha = 0.2
    ) + 
    facet_wrap(~measure, nrow = 1) +                                            # Generate two plots
    theme_minimal() +                                                           # Theme styling
    theme(
        axis.title.x = element_blank(), 
        axis.title.y = element_blank(), 
        axis.text.x = element_blank(), 
        panel.grid.major.x = element_blank(), 
        strip.background = element_rect(fill = "#BE0000", color = NA), 
        strip.text = element_text(color = "white", face = "bold")
    )

Clearly the data are highly skewed. This opens up the possibility for a logarithmic transformation. Let’s run the same code by using a log scale:

ggplot(                                                                         # Setup ggplot object
    px_pivot, 
    aes(x = "1", y = value)
) +                                    
    ggdist::stat_halfeye(                                                       # Distribution plot
        adjust = 0.5, 
        width = 0.6, 
        .width = 0, 
        justification = -0.2, 
        point_color = NA
    ) +
    geom_boxplot(                                                               # Boxplot
        width = 0.15, 
        outlier.shape = NA
    ) +
    gghalves::geom_half_point(                                                  # Individual observations
        side = "l", 
        range_scale = 0.4, 
        alpha = 0.2
    ) + 
    scale_y_log10() +  
    facet_wrap(~measure, nrow = 1) +                                            # Generate two plots
    theme_minimal() +                                                           # Theme styling
    theme(
        axis.title.x = element_blank(), 
        axis.title.y = element_blank(), 
        axis.text.x = element_blank(), 
        panel.grid.major.x = element_blank(), 
        strip.background = element_rect(fill = "#BE0000", color = NA), 
        strip.text = element_text(color = "white", face = "bold")
    )

That’s much improved! Such a transform may be a great way to improve our eventual forecasting model and better align with the assumptions of ARIMA.

Preprocessing

Data cleaning / Feature Engineering

As we’ve already assessed, there’s no “cleaning” to perform on the data since everything we’re going to use is complete. However, we do want to do some feature engineering. We’re going to tackle a handful of issues/checks:

  1. We need to formally aggregate the data so we haven’t multiple levels per point in time
  2. We need resolve the skewness; we’ve already shown how to do that with a logarithmic transform
  3. We need to confirm non-stationarity; an ARIMA model requires no trend
  4. We also need to assess “auto-correlation”, so the model knows how much to error to use in the moving average and how many lagged points to include

Aggregation

px_agg <-
    px_data |>
    as_tibble() |>
    summarise(
        Total_Scripts = sum(Scripts), 
        Total_Cost = sum(Cost), 
        .by = Month
    ) |>
    mutate(Month = yearmonth(Month)) |>
    tsibble(index = Month)

head(px_agg)
# A tsibble: 6 x 3 [1M]
     Month Total_Scripts Total_Cost
     <mth>         <dbl>      <dbl>
1 1991 Jul       8090395   90494401
2 1991 Aug       7308118   83068686
3 1991 Sep       7614644   87399990
4 1991 Oct       8206248   96884880
5 1991 Nov       7686634   93892910
6 1991 Dec       8954872  112612920

Resolve skewness

px_norm <-
    px_agg |>
    mutate(across(
        Total_Scripts:Total_Cost, 
        ~log(.), 
        .names = "{.col}_log"
    ))

head(px_norm)
# A tsibble: 6 x 5 [1M]
     Month Total_Scripts Total_Cost Total_Scripts_log Total_Cost_log
     <mth>         <dbl>      <dbl>             <dbl>          <dbl>
1 1991 Jul       8090395   90494401              15.9           18.3
2 1991 Aug       7308118   83068686              15.8           18.2
3 1991 Sep       7614644   87399990              15.8           18.3
4 1991 Oct       8206248   96884880              15.9           18.4
5 1991 Nov       7686634   93892910              15.9           18.4
6 1991 Dec       8954872  112612920              16.0           18.5

Stationarity

We confirm stationarity using the “Augmented Dickey-Fuller Test”. We’ll do it for each of our logarithmically transformed variables, Total_Scripts_log and Total_Cost_log:

adf.test(px_norm$Total_Scripts_log, alternative = "stationary")

    Augmented Dickey-Fuller Test

data:  px_norm$Total_Scripts_log
Dickey-Fuller = -5.5837, Lag order = 5, p-value = 0.01
alternative hypothesis: stationary
adf.test(px_norm$Total_Cost_log, alternative = "stationary")

    Augmented Dickey-Fuller Test

data:  px_norm$Total_Cost_log
Dickey-Fuller = -3.5836, Lag order = 5, p-value = 0.03617
alternative hypothesis: stationary

Are the data stationary? Well, we set the alternative hypothesis to be “stationary”, therefore a statistically significant result from the test (very low p-value) would indicate they are. And that’s what we’re getting.

However, we can see some trend in the data. Let’s difference the values and see the results.

adf.test(diff(px_norm$Total_Scripts_log), alternative = "stationary")

    Augmented Dickey-Fuller Test

data:  diff(px_norm$Total_Scripts_log)
Dickey-Fuller = -8.8806, Lag order = 5, p-value = 0.01
alternative hypothesis: stationary
adf.test(diff(px_norm$Total_Cost_log), alternative = "stationary")

    Augmented Dickey-Fuller Test

data:  diff(px_norm$Total_Cost_log)
Dickey-Fuller = -8.9576, Lag order = 5, p-value = 0.01
alternative hypothesis: stationary

THere’s “some” improvement in the Total_Cost_log, but its marginal. What if we plot, quickly:

ggplot(
    px_norm |>
    mutate(across(
        c(Total_Cost_log, Total_Scripts_log), 
        ~difference(.)
    )) |>
        as_tibble() |>
        pivot_longer(
            cols = c(Total_Scripts_log, Total_Cost_log), 
            names_to = "measure", values_to = "values"
        ), 
    aes(Month, values)
) +
    geom_line() +
    facet_wrap(~measure, ncol = 1, scales = "free") +                           # Generate two plots
    theme_minimal() +                                                           # Theme styling
    theme(
        axis.title.y = element_blank(), 
        axis.title.x = element_blank(), 
        panel.grid.major.x = element_blank(), 
        strip.background = element_rect(fill = "#BE0000", color = NA), 
        strip.text = element_text(color = "white", face = "bold")
    )

That looks A LOT more stationary. We would say that \(d=1\) for each value is sufficient in an ARIMA model.

px_diff <- 
    px_norm |>
    mutate(
        Total_Scripts_log_diff = difference(Total_Scripts_log), 
        Total_Cost_log_diff = difference(Total_Cost_log)
    )

head(px_diff)
# A tsibble: 6 x 7 [1M]
     Month Total_Scripts Total_Cost Total_Scripts_log Total_Cost_log
     <mth>         <dbl>      <dbl>             <dbl>          <dbl>
1 1991 Jul       8090395   90494401              15.9           18.3
2 1991 Aug       7308118   83068686              15.8           18.2
3 1991 Sep       7614644   87399990              15.8           18.3
4 1991 Oct       8206248   96884880              15.9           18.4
5 1991 Nov       7686634   93892910              15.9           18.4
6 1991 Dec       8954872  112612920              16.0           18.5
# ℹ 2 more variables: Total_Scripts_log_diff <dbl>, Total_Cost_log_diff <dbl>

Auto-correlation

Finally, let’s resolve any auto-correlation issues. We can assess this in two ways:

acf(px_diff$Total_Scripts_log_diff[-1])

acf(px_diff$Total_Cost_log_diff[-1])

In these plots, we’re seeing 1 significant correlation (the second bar from the left). For an ARIMA model, we’ll go with \(q=1\).

Next, let’s look at partial auto-correlation to determine \(p\):

pacf(px_diff$Total_Scripts_log_diff[-1])

pacf(px_diff$Total_Cost_log_diff[-1])

These plots show three (3) significant correlations so we would use \(p=3\) in an ARIMA model.

Of course, these values are slightly fluid. We could run multiple models testing different values and comparing the resulting AIC. The model with the lowest AIC is the best performing.

Forecasting model

Auto-ARIMA

For this exercise, we’re going to let ARIMA find the best model. It will go through the same process we just did, finding optimal valuse of \(d, q, p\).

px_Scripts_fit <- auto.arima(px_norm$Total_Scripts_log)
px_Cost_fit <- auto.arima(px_norm$Total_Cost_log)

Evaluation

Let’s see how the models performed, starting with the forecast for Scripts. We can use the checkresiduals() function:

checkresiduals(px_Scripts_fit)


    Ljung-Box test

data:  Residuals from ARIMA(2,1,2) with drift
Q* = 9.3938, df = 6, p-value = 0.1526

Model df: 4.   Total lags used: 10

Interestingly, the model chose \(d=2\) (we thought \(d=1\) was sufficient), \(q=1\) (same as our analysis), and \(p=2\) (we estimated \(p=3\)). Looking at the plots, the model is only fair.

We can take this model and plot the forecast:

fitted_mod <- forecast(px_Scripts_fit, h=12, level = c(80))

autoplot(fitted_mod) #+

    #scale_y_continuous(trans = scales::transform_exp())

Remember, those y-axis values for total number of scripts are on a logarithmic scale. We can extract the forecasted values from the model and transform them back to a linear scale:

tibble(
    lower = exp(fitted_mod$lower), 
    mean = exp(fitted_mod$mean), 
    upper = exp(fitted_mod$upper)
)
# A tibble: 12 × 3
   lower[,"80%"]      mean upper[,"80%"]
   <ts>              <dbl> <ts>         
 1 12447660      14186143. 16167428     
 2 13112376      14997233. 17153031     
 3 13205892      15106715. 17281138     
 4 13056650      14945642. 17107928     
 5 12876344      14742098. 16878194     
 6 12739058      14585073. 16698594     
 7 12654614      14493220. 16598959     
 8 12612583      14457097. 16571359     
 9 12599541      14460088. 16595379     
10 12604312      14487084. 16651095     
11 12618968      14527082. 16723721     
12 12638468      14573130. 16803945     

We can interpret this as total scripts during the first forecasted month will fall between 12.4M and 16.2M with 80% certainty.