Blood Donation on DrivenData - Part I Exploration

DrivenData.org is a machine learning competition web site similar to the better known Kaggle.com site with a different angle. It focuses on leveraging Data Science for social issues. And it’s based in Boston!

For the learning Data Scientist, DrivenData offers a good alternative to Kaggle. With less exposure comes less competition. which means that you can actually reach decent positions on the leaderboard without commiting to a Monk’s life. Even though some of the best Kagglers can also be found on DrivenData Leaderboards.

The Blood Donation Competition

The code is here

The entry level competition is a supervised multivariate classification prediction problem around blood donation prediction. Given the past donation behavior of 748 donators, the goal is to predict if these people will or won’t donate blood on a future given month.

The data comes from the UCI Machine Learning repository. The Blood Transfusion Service Center Data Set was obtained from the Blood Transfusion Service Center in Hsin-Chu City in Taiwan. That data set has been used in many research publications.

The set of features is simple and is commonly known as the RFM model. It has been used extensively in marketing modelization of potential customer behavior.

Knowing i) the Recency [R]: duration since the last blood donation (or purchase), ii) the Frequency [F]: the total number of donations (or puchases) and iii) the Monetary [M]*: the amount of blood donated (or the amount of money spent), the challenge is to predict a binary variable representing whether or not the person will donate blood (or make a purchase).

In our context, a 4th feature has been added: the Time [T] as the number of months since the first donation. We are dealing with a RFMT model, sometimes also called a LRFM model with L as length.

The DrivenData Set

The DrivenData set has been split into a training and a test set. The training set is composed of 576 rows and the testing set of 200 rows. The features are titled (as in the original UCI dataset):

  • User_ID
  • Months since Last Donation
  • Number of Donations
  • Total Volume Donated (c.c.)
  • Months since First Donation
  • Made Donation in March 2007

Since the volume of blood donated is always the same per person and per donation, the feature: “Total Volume Donated (c.c.)” (Monetary) is exactly proportional to the “Number of Donations” (Frequency) with Monetary = 250 * Frequency. The donors gave 250 c.c. of blood each time.

The evaluation metric used is log loss. The predictions are the probability that a donor made another donation in March 2007.

Data exploration in R

Loading the data

library('readr')
setwd("your current working directory")
train <- read_csv('data/ddblood_train.csv')
test  <- read_csv('data/ddblood_test.csv')

A first look at the train set:

> str(train)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   576 obs. of  6 variables:
$ User_ID                    : int  619 664 441 160 358 335 47 164 736 436 ...
$ Months since Last Donation : int  2 0 1 2 1 4 2 1 5 0 ...
$ Number of Donations        : int  50 13 16 20 24 4 7 12 46 3 ...
$ Total Volume Donated (c.c.): int  12500 3250 4000 5000 6000 1000 1750 3000 11500 750 ...
$ Months since First Donation: int  98 28 35 45 77 4 14 35 98 4 ...
$ Made Donation in March 2007: int  1 1 1 1 0 0 1 0 1 0 ...

I will rename the features first to avoid dealing with long variables names with whitespaces and to be closer to the RFMT model.

colnames(train) <- c("Id","Recency", "Frequency", "Monetary",  "Time", "C")
colnames(test)  <- c("Id","Recency", "Frequency", "Monetary", "Time")

And finally removing the target (y) from the train set and keeping the adequate features only.

y         <- train$C
train <- train[,c("Recency", "Frequency", "Time") ,drop=FALSE]
test  <- test[ ,c("Recency", "Frequency", "Time") ,drop=FALSE]

Train / Test comparison

Figure 1 below shows boxplots of the features for the train and test sets. We can see that the train and test sets have very similar distributions. Some outliers (Recency > 30 or Frequency > 20) are clearly visible.

Fig1 - Comparing Train and Test sets

Let’s check the Recency outliers for the train set.

> index <- train$Recency > 30
> train[index, ]
    Recency Frequency Time
385      35         3   64
386      74         1   74
575      39         1   39
576      72         1   72
> y[index]
[1] 0 0 0 0

Basically these 4 persons have not been giving their blood in the past 35, 39, 72 and 74 months. 3 of them have donated only once (Frequency = 1 or Time = Recency) and none of them have donated in March. This looks coherent and these outliers appear to be relevant to the train set as opposed to being anomalies due to human errors.

Similarly, let’s look at the Frequency outliers, people who have donated more than 30 times.

> index <- train$Frequency > 30
> train[index, ]
    Recency Frequency Time
1         2        50   98
9         5        46   98
264      23        38   98
387       2        43   86
389       2        44   98
398       4        33   98
> y[index]
[1] 1 1 0 1 0 1

This subset is composed of frequent donors:

Of the 2 who have not donated in March

  • one did not donate on the past 23 months, although donated 38 times before that. Probably stopped donating all together.
  • The other one gave blood 2 months ago, 44 times before that in the past 98 months.

This is highly coherent and these outliers are also probably legit. So we’ll keep the outliers from the train set.

March donors vs non-donors

Let’s now compare the feature distribution for the people who have donated in March and those who have not.

Fig2 - Comparing donors vs non-donors

Fig 2 shows that

  • People who have donated more recently (low Recency) are more prone to donate in March. This is a known marketing result. The oldest the last purchase the less chances the person will buy again.
  • People who have donated often (high Frequency) are more prone to donate again.
  • Striking is the smilarity for the Time feature showing that the time since first donation is less of a factor in future donations. Also a known phenomenon in customer behavior. It does not matter when you first made a purchase, what matters is how recent your last purchase is.

Correlation

Fig 3 shows the correlation matrix between the 3 features for the train set. Fig3 - Correlation matrix for the train set

The correlation between Frequency and Time was to be anticipated. The more people have donated the earlier they started donating.

Some questions

  • Every 3 months?

The data description on the UCI repository states that: “The center passes their blood transfusion service bus to one university in Hsin-Chu City to gather blood donated about every three months.”

Which is at odds with the fact that different Recency numbers are not factors of 3:

unique(sort(train$Recency))
    [1]  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 20 21 22 23 25 26 35 39 72 74
  • Multiple donations in the same month?

Some people have given several times in a month although there is a compulsory minimum time between donations. (… for the red cross for instance)

> index <- (train$Recency == train$Time) & train$Frequency > 1
> dim(train[index,])
[1] 26  3

26 people have donated more than once and 3 people have even donated more than twice in the same month

> index <- (train$Recency == train$Time) & train$Frequency > 2
> train[index,]
    Recency Frequency Time
6         4         4    4
392       4         4    4
430      14         5   14

No such cases can be found in the test case. It has been suggested that these donations were possibily platelets donations which follow less strict Frequency rules than blood donations.

Next: Feature Engineering and Feature Importance for Gini and Accuracy metrics in Random Forests


If you liked this post, please share it on twitter And leave me your feedback, questions, comments, suggestions below. Much appreciated :)