Nonalcoholic steatohepatitis (NASH) and cirrhosis are among the most dire consequences of obesity yet they are only able to be diagnosed with a liver biopsy.With the development of machine learning (ML) algorithms, we are now able to utilize clinical data in much more sophisticated ways. The aim of this project was to determine if ML could identify Nonalcoholic Fatty Liver Disease (NAFLD), NASH or advanced fibrosis using commonly available clinical and laboratory data.


This study used a research dataset derived from the UF NAFLD/NASH program. We collected thirty-three clinical and lab variables from a total of 488 patients with liver biopsy data. Outcomes of the biopsy were NAFLD, NASH, and fibrosis. To represent the clinical and laboratory variables in ML, we compared one hot encoding (OHE), where the continuous lab values were converted into clinically meaningful categories, and MIX encoding, where both OHE and continuous values were used. We compared three ML algorithms: Logistic Regression (LR), Decision Tree (DT), and Radom Forests (RFs), for the prediction of NAFLD, NASH, and FIBROSIS. We used 5-fold cross-validation and reported the area under the receiver operating characteristic curve (AUC or AUC-ROC) as evaluation metrics.


Among the 488 patients, 342 had NAFLD, 198 had NASH and 60 had advanced fibrosis. Among the three ML algorithms, the RFs using MIX achieved the best AUC_ROC score. For NAFLD the AUC_ROC score was 0.909, for NASH it was 0.815 and for advanced fibrosis it was 0.864.


This study shows that it is feasible to use ML algorithms to identify NAFLD, NASH, and fibrosis using common clinically available data. Further validation using larger datasets is required. This could obviate the need for a liver biopsy to diagnose NASH and improve treatment at this reversible stage.