I have been struggling with imputation recently. Therefore, I decided to create a series on data imputation.
I am going to use the Heart Failure Prediction Dataset for demonstrations. Unfortunately, the dataset does not contain any missing data. So, I had to introduce some :-)
As usual, there is a link to the source code at the bottom of this page. You can find the implementation details there.
The data set consists of categorical and numeric variables. For the sake of simplicity, I picked a numeric variable (i.e. Cholesterol) and randomly removed 5% of it to demonstrate how simple imputation works.
df.loc[df.sample(frac=0.05).index, columnToImpute] = np.nan
I did the imputation, did the analysis, and wrote this post (, which I am now revising). However, I later discovered that the dataset is not really complete. There are just too many zeros on the Cholesterol column. So, I decided to impute zeros directly.
df['Cholesterol'].value_counts().head()
0 172
254 11
223 10
220 10
211 9
Name: Cholesterol, dtype: int64
I tried several strategies for imputation, i.e., 'drop', 'mean', 'median', and 'most_frequent'.
Actually, drop is not a strategy provided by scikit learn. I made it up to evaluate effect of dropping rows with missing values instead of imputing. It will be my benchmark to evaluate imputation.
By the way, there is another strategy called "constant" that I do not consider here. Because, I do not have any expertise about what that constant's value might be.
strategies = ['drop', 'mean', 'median', 'most_frequent']
Then, for each strategy, I created an imputed (or dropped) version of the dataset and ran a classifier to find training scores.
I picked the Random Forest classifier for this task. Not for a specific reason.
print("Cros-validation scores for different imputation strategies\n")
print("Strategy Score Std.Dev.")
print("-----------------------------------")
results = []
# Perform imputation for each strategy and
# apply random forest classification to imputed data.
for s in strategies:
# We randomly picked this classifier
rf = RandomForestClassifier()
dfTemp = dfDummied.copy()
if s == 'drop':
# We will not impute
# We will remove rows with missing values instead.
dfTemp = dfTemp.dropna(axis=0)
else:
# Set the strategy
imp = SimpleImputer(missing_values=0, strategy=s)
dfTemp[columnToImpute] =
imp.fit_transform(dfTemp[[columnToImpute]])
# Create independent and dependant variables sets
X, y = dfTemp.values[:, :11], dfTemp.values[:, 11]
# Let's perform a 10-fold cross validation
scores = cross_val_score(rf, X, y, cv=10)
# store the scores in a list
results.append(scores)
print('%-15s %7.3f %7.3f' % (s, np.mean(scores),
np.std(scores)))
Cros-validation scores for different imputation strategies
Strategy Score Std.Dev.
-----------------------------------
drop 0.941 0.030
mean 0.945 0.027
median 0.943 0.028
most_frequent 0.944 0.024
Honestly, I can not say for sure that imputation worked well for this dataset and for the selected column when the missing value rate is relatively low (18% in above experiments). There may be cases it could work though.
Then, I did everything again by using a higher missing value rate for the Cholesterol column, i.e., 25%.
Strategy Score Std.Dev.
-----------------------------------
drop 0.939 0.039
mean 0.945 0.027
median 0.945 0.027
most_frequent 0.948 0.026
However, I guess I can say that imputation is likely to work better than dropping rows when the missing value rate is getting higher. I do not know. There are just too many possibilities to try. Depending on the dataset's characteristics, the mean and the most_frequent strategies seem more likely to work.
On the other hand, most of the cases, the univariate imputation is a no go for me. After all, if a patient's cholesterol was not measured, it makes no sense to put a random number instead.
I guess I can say that univariate imputation is likely to work when the missing value rate is higher. I repeated the experiment few times and the training score for the drop strategy was always the lowest.
Of course, if the missing rate value is too high for a column, it may be better to consider dropping that column instead of imputation.
As I probably said before, dropping or imputing data is a tedious task. One must carefully investigate the data and consider the application scenario for trade offs.
Otherwise, you waste your time on a blog post, later discover your stupid mistake, and waste more time correcting and publishing your stupidity.
Anyhow. I find this topic interesting and will keep posting on this.
Thank you for reading this post. If you have anything to say/object/correct, please drop a comment down below.
The code used in this post is available at GitHub. Feel free to use, distribute, or contribute.
Comments