Estimation of a predictor's importance by Random Forests when there is missing data: risk prediction in liver surgery using laboratory data

Int J Biostat. 2014;10(2):165-83. doi: 10.1515/ijb-2013-0038.

Abstract

Abstract In the last few decades, new developments in liver surgery have led to an expanded applicability and an improved safety. However, liver surgery is still associated with postoperative morbidity and mortality, especially in extended resections. We analyzed a large liver surgery database to investigate whether laboratory parameters like haemoglobin, leucocytes, bilirubin, haematocrit and lactate might be relevant preoperative predictors. It is not uncommon to observe missing values in such data. This also holds for many other data sources and research fields. For analysis, one can make use of imputation methods or approaches that are able to deal with missing values in the predictor variables. A representative of the latter are Random Forests which also provide variable importance measures to assess a variable's relevance for prediction. Applied to the liver surgery data, we observed divergent results for the laboratory parameters, depending on the method used to cope with missing values. We therefore performed an extensive simulation study to investigate the properties of each approach. Findings and recommendations: Complete case analysis should not be used as it distorts the relevance of completely observed variables in an undesirable way. The estimation of a variable's importance by a self-contained measure that can deal with missing values appropriately reflects the decreased relevance of variables with missing values. It can therefore be used to obtain insight into Random Forests which are commonly fit without preprocessing of missing values in the data. By contrast, multiple imputation allows for the assessment of a variable's relevance one would potentially observe in complete-data situations, if imputation performs well. For the laboratory data, lactate and bilirubin seem to be associated with the risk of liver failure and postoperative complications. These relations should be investigated by future studies in more detail. However, it is important to carefully consider the method used for analysis when there are missing values in the predictor variables.

MeSH terms

  • Algorithms*
  • Biomarkers
  • Computer Simulation
  • Data Interpretation, Statistical*
  • Databases, Factual
  • Humans
  • Liver / surgery*
  • Models, Statistical*
  • Postoperative Complications / blood
  • Postoperative Complications / epidemiology*
  • Prognosis
  • Regression Analysis

Substances

  • Biomarkers