Handling Missing Data: Part 2 - Multivariate Imputation

Updated: Mar 27, 2022

I continue with the second post of the imputation series. I had covered univariate imputation before. If you missed it somehow, you can find it here.

Let us continue with a more sophisticated imputation method, multivariate imputation. The multivariate imputation algorithms use the entire set of available feature dimensions to estimate the missing values (e.g. impute.IterativeImputer).

I am going to use the same Heart Failure Prediction Dataset for the entire imputation series.

To demonstrate imputation, I created missing values at 5% rate on a numeric column, i.e. MaxHR. I also considered zeros on the Cholesterol column as missing.

By the way, do not worry about the implementation details already. You know that I always provide the source codes at the bottom of my posts :-)

After having prepared the dataset, I tried several strategies for imputation, i.e., 'drop', 'mean', 'median', 'most_frequent, and 'iterative'.

The 'mean', 'median', and 'most_frequent imputations are strategies for univariate imputation. The 'iterative' imputation represent multivariate one.

strategies = ['drop', 'mean', 'median', 
                                'most_frequent', 'iterative']

By the way, drop is not a strategy provided by scikit learn. I made it up to evaluate effect of dropping rows with missing values instead of imputing. It will be my benchmark to evaluate the effect of imputation on the classification performance.

Then, for each strategy, I created an imputed (or dropped) version of the dataset and ran a classifier to find training scores.

I picked the Random Forest classifier for this task. Not for a specific reason.

print("Cros-validation scores for different imputation strategies\n")
print("Strategy          Score   Std.Dev.")
print("-----------------------------------")

results = []

imputedDF.insert(0, 'missing', df[columnsToImpute[1]])

imputedList.append(dfDummied[columnsToImpute[1]])

# Perform imputation for each strategy and 
# apply random forest classification to imputed data.
for s in strategies:
        
    # We randomly picked this classifier
    rf = RandomForestClassifier()
    
    dfTemp = dfDummied.copy()

    if s == 'drop':
        # We will not impute
        # We will remove rows with missing values instead.
        dfTemp = dfTemp.dropna(axis=0)
    elif s == 'iterative':
        # Means multivariate imputation
        imp = IterativeImputer(missing_values=np.NaN, 
                                        random_state=seed)
        dfTemp = pd.DataFrame(imp.fit_transform(dfTemp), 
                                    columns=dfTemp.columns)
        imputedDF.insert(0, s, dfTemp[columnsToImpute[1]])

    else:
        # Set the strategy for unşvariate imputation
        imp = SimpleImputer(missing_values=np.NaN, strategy=s)
        dfTemp[columnsToImpute] = 
                    imp.fit_transform(dfTemp[columnsToImpute])
        imputedDF.insert(0, s, dfTemp[columnsToImpute[1]])
    
    # Create independent and dependant variables sets
    X, y = dfTemp.values[:, :11], dfTemp.values[:, 11]
    
    # Let's perform a 10-fold cross validation
    scores = cross_val_score(rf, X, y, cv=10)
    
    # store the scores in a list
    results.append(scores)
    
    print('%-15s %7.3f  %7.3f' % (s, np.mean(scores), 
                                            np.std(scores)))

Strategy          Score   Std.Dev.
-----------------------------------
drop              0.935    0.035
mean              0.950    0.027
median            0.943    0.025
most_frequent     0.945    0.029
iterative         0.949    0.027

There is no surprise for the drop column case. The less data you have, worse results you get.

On the hand, when we compare the RF classification results, we see that the multivariate (iterative) imputing did not achieve any better than its univariate rivals, either.

So, I can not say that the multivariate imputation is better than the univariate imputation, yet.

We should dig deeper.

Let us see how imputation worked on the dataset. I will be comparing the original and imputed values for the MaxHR column given in the table below (Only, top 5 of them.).

iterative 	most_frequent  median 	  mean  missing	 original
---------------------------------------------------------------
123.389351 	150.0 			138.0 	136.50 	  NaN 		125
159.560506 	150.0 			138.0 	136.50 	  NaN 		178
130.549229 	150.0 			138.0 	136.50 	  NaN 		122
152.985189 	150.0 			138.0 	136.50 	  NaN 		148
149.873344 	150.0 			138.0 	136.50 	  NaN 		130

The first line reads as: The original value of MaxHR was 125. It was randomly set to np.NAN and then imputed. The iterative imputer returned the closest value, i.e., 123.389351.

Let us compare how much the imputed values deviate from the original value. We can use the mean squared error for instance.

print("Strategy         Mean Squared Error (MSE)")
print("-----------------------------------------")

for c in imputedDF.columns:
    if c not in ('missing', 'original'):
        print('%-15s %15.2f' % (c, np.round(mean_squared_error(
                    imputedDF['original'], imputedDF[c]), 2)))

Strategy         Mean Squared Error (MSE)
-----------------------------------------
iterative                 18.12
most_frequent             30.03
median                    28.26
mean                      29.04

Hmm, the iterative imputing seemed to have done a better job.

Honestly, I still can not say for sure that which imputation method (univariate vs. multivariate) worked well for this dataset.

When I compared the predictions of a randomly picked Rf model, the imputation seems to have no effect. Results are all close. This may due to the Rf algorithm, which is ok with missing data.

However, when I look at the imputation results, MSE is the lowest for the multivariate imputation. In other words, multivariate imputation suggest values which are more close the original values.

That's all for multivariate imputation. I intend to continue with imputation of categorical variables next time.

Thank you for reading this post. If you have anything to say/object/correct, please drop a comment down below.

The code used in this post is available at GitHub. Feel free to use, distribute, or contribute.

The Good Class

by Fuat Akal

Handling Missing Data: Part 2 - Multivariate Imputation

Recent Posts

Comments

Never Miss a Post. Subscribe Now!