Market basket analysis: Example using grocery data

Idea

Our market basket analysis is based on the purchase data collected from one month of operation at a real-world grocery store. The data contains 9,835 transactions or about 327 transactions per day (roughly 30 transactions per hour in a 12-hour business day), suggesting that the retailer is not particularly large, nor is it particularly small.

Given the moderate size of the retailer, we will assume that they are not terribly concerned with finding rules that apply only to a specific brand of milk or detergent. With this in mind, all brand names can be removed from the purchases. This reduces the number of groceries to a more manageable 169 types, using broad categories such as chicken, frozen meals, margarine, and soda.

General considerations

There are two key points to do:

  • Obtain the data and create the model
  • Analyze the model

Technical implementation

Get the data & create the model

library(arules)
## Step 1. Get data
groceries <- read.transactions("groceries.csv", sep = ",")

## Step 2. Analyzing data
summary(groceries)
inspect(groceries[1:3])
itemFrequency(groceries[, 1:3])

## Step 3. Creating the model
groceryrules <- apriori(groceries, parameter = list(support = 0.006, confidence = 0.25, minlen = 2))

Analyze the model

## Step 4. Inspecting the model
inspect(groceryrules[1:3])
#     lhs             rhs               support     confidence lift     count
# [1] {pot plants} => {whole milk}      0.006914082 0.4000000  1.565460 68   
# [2] {pasta}      => {whole milk}      0.006100661 0.4054054  1.586614 60   
# [3] {herbs}      => {root vegetables} 0.007015760 0.4312500  3.956477 69   

inspect(sort(groceryrules, by = "lift") [1:3])
#     lhs                                             rhs                  support     confidence lift     count
# [1] {herbs}                                      => {root vegetables}    0.007015760 0.4312500  3.956477 69   
# [2] {berries}                                    => {whipped/sour cream} 0.009049314 0.2721713  3.796886 89   
# [3] {other vegetables,tropical fruit,whole milk} => {root vegetables}    0.007015760 0.4107143  3.768074 69   

inspect(sort(groceryrules, by = "support") [1:3])
#     lhs                   rhs                support    confidence lift     count
# [1] {other vegetables} => {whole milk}       0.07483477 0.3867578  1.513634 736  
# [2] {whole milk}       => {other vegetables} 0.07483477 0.2928770  1.513634 736  
# [3] {rolls/buns}       => {whole milk}       0.05663447 0.3079049  1.205032 557  

inspect(sort(groceryrules, by = "confidence") [1:3])
#     lhs                            rhs          support     confidence lift     count
# [1] {butter,whipped/sour cream} => {whole milk} 0.006710727 0.6600000  2.583008 66   
# [2] {butter,yogurt}             => {whole milk} 0.009354347 0.6388889  2.500387 92   
# [3] {butter,root vegetables}    => {whole milk} 0.008235892 0.6377953  2.496107 81   

berryrules <- subset(groceryrules, items %in% "berries") inspect(berryrules) # lhs rhs support confidence lift count # [1] {berries} => {whipped/sour cream} 0.009049314 0.2721713  3.796886  89  
# [2] {berries} => {yogurt}             0.010574479 0.3180428  2.279848 104  
# [3] {berries} => {other vegetables}   0.010269446 0.3088685  1.596280 101  
# [4] {berries} => {whole milk}         0.011794611 0.3547401  1.388328 116

 

Coming soon a detailed explanation about how the basket algorithm works

Bibliografia: