Objective: Cardiovascular disease (CVD) is one of the leading causes of death worldwide and multiple questions urgently need answering, especially in risk identification and prognosis prediction. Real-world study (RWS), with huge numbers of observations, is an important data basis for CVD research, but it is constrained by high dimensionality, missing, and unstructured data. Machine learning (ML) methods, including a variety of supervised and unsupervised algorithms, are useful for data governance and effective for high dimensional data analysis and imputation in the real-world study. This study reviewed the theory, strength, limitation, and application of several popular ML methods in the CVD field as a reference for further application.
Methods: This study introduced the origin, purpose, theory, superiorities, limitations, and applications of multiple popular ML algorithms, including hierarchical and k-means clustering, principal component analysis, random forest, support vector machine, and neural networks. An example using the Systolic Blood Pressure Intervention Trial (SPRINT) data was performed with the random forest to demonstrate the process and main results of ML application in CVD.
Conclusion: ML methods are effective tools to produce real-world evidence to support clinical decisions and meet clinical needs. This review explains the principles of multiple ML methods in an easy-to-understand language and could be a reference for further application. Future research is warranted for accurate ensemble learning methods and wide application in the medical field.