I continue with the second post of the imputation series. I had covered univariate imputation before. If you missed it somehow, you can find it here.
Let us continue with a more sophisticated imputation method, multivariate imputation. The multivariate imputation algorithms use the entire set of available feature dimensions to estimate the missing values (e.g. impute.IterativeImputer).
I am going to use the same Heart Failure Prediction Dataset for the entire imputation series.
To demonstrate imputation, I created missing values at 5% rate on a numeric column, i.e. MaxHR. I also considered zeros on the Cholesterol column as missing.
By the way, do not worry about the implementation details already. You know that I always provide the source codes at the bottom of my posts :-)
After having prepared the dataset, I tried several strategies for imputation, i.e., 'drop', 'mean', 'median', 'most_frequent, and 'iterative'.
The 'mean', 'median', and 'most_frequent imputations are strategies for univariate imputation. The 'iterative' imputation represent multivariate one.
strategies = ['drop', 'mean', 'median',
'most_frequent', 'iterative']
By the way, drop is not a strategy provided by scikit learn. I made it up to evaluate effect of dropping rows with missing values instead of imputing. It will be my benchmark to evaluate the effect of imputation on the classification performance.
Then, for each strategy, I created an imputed (or dropped) version of the dataset and ran a classifier to find training scores.
I picked the Random Forest classifier for this task. Not for a specific reason.
print("Cros-validation scores for different imputation strategies\n")
print("Strategy Score Std.Dev.")
print("-----------------------------------")
results = []
imputedDF.insert(0, 'missing', df[columnsToImpute[1]])
imputedList.append(dfDummied[columnsToImpute[1]])
# Perform imputation for each strategy and
# apply random forest classification to imputed data.
for s in strategies:
# We randomly picked this classifier
rf = RandomForestClassifier()
dfTemp = dfDummied.copy()
if s == 'drop':
# We will not impute
# We will remove rows with missing values instead.
dfTemp = dfTemp.dropna(axis=0)
elif s == 'iterative':
# Means multivariate imputation
imp = IterativeImputer(missing_values=np.NaN,
random_state=seed)
dfTemp = pd.DataFrame(imp.fit_transform(dfTemp),
columns=dfTemp.columns)
imputedDF.insert(0, s, dfTemp[columnsToImpute[1]])
else:
# Set the strategy for unşvariate imputation
imp = SimpleImputer(missing_values=np.NaN, strategy=s)
dfTemp[columnsToImpute] =
imp.fit_transform(dfTemp[columnsToImpute])
imputedDF.insert(0, s, dfTemp[columnsToImpute[1]])
# Create independent and dependant variables sets
X, y = dfTemp.values[:, :11], dfTemp.values[:, 11]
# Let's perform a 10-fold cross validation
scores = cross_val_score(rf, X, y, cv=10)
# store the scores in a list
results.append(scores)
print('%-15s %7.3f %7.3f' % (s, np.mean(scores),
np.std(scores)))
Strategy Score Std.Dev.
-----------------------------------
drop 0.935 0.035
mean 0.950 0.027
median 0.943 0.025
most_frequent 0.945 0.029
iterative 0.949 0.027
There is no surprise for the drop column case. The less data you have, worse results you get.
On the hand, when we compare the RF classification results, we see that the multivariate (iterative) imputing did not achieve any better than its univariate rivals, either.
So, I can not say that the multivariate imputation is better than the univariate imputation, yet.
We should dig deeper.
Let us see how imputation worked on the dataset. I will be comparing the original and imputed values for the MaxHR column given in the table below (Only, top 5 of them.).
iterative most_frequent median mean missing original
---------------------------------------------------------------
123.389351 150.0 138.0 136.50 NaN 125
159.560506 150.0 138.0 136.50 NaN 178
130.549229 150.0 138.0 136.50 NaN 122
152.985189 150.0 138.0 136.50 NaN 148
149.873344 150.0 138.0 136.50 NaN 130
The first line reads as: The original value of MaxHR was 125. It was randomly set to np.NAN and then imputed. The iterative imputer returned the closest value, i.e., 123.389351.
Let us compare how much the imputed values deviate from the original value. We can use the mean squared error for instance.
print("Strategy Mean Squared Error (MSE)")
print("-----------------------------------------")
for c in imputedDF.columns:
if c not in ('missing', 'original'):
print('%-15s %15.2f' % (c, np.round(mean_squared_error(
imputedDF['original'], imputedDF[c]), 2)))
Strategy Mean Squared Error (MSE)
-----------------------------------------
iterative 18.12
most_frequent 30.03
median 28.26
mean 29.04
Hmm, the iterative imputing seemed to have done a better job.
Honestly, I still can not say for sure that which imputation method (univariate vs. multivariate) worked well for this dataset.
When I compared the predictions of a randomly picked Rf model, the imputation seems to have no effect. Results are all close. This may due to the Rf algorithm, which is ok with missing data.
However, when I look at the imputation results, MSE is the lowest for the multivariate imputation. In other words, multivariate imputation suggest values which are more close the original values.
That's all for multivariate imputation. I intend to continue with imputation of categorical variables next time.
Thank you for reading this post. If you have anything to say/object/correct, please drop a comment down below.
The code used in this post is available at GitHub. Feel free to use, distribute, or contribute.
Comentarios