Abstract
This study implements two supervised machine learning models,
Decision Tree and Multilayer Perceptron
(MLP), to predict heart attack likelihood using a
labeled dataset of 1,888 rows and 14 features. Leveraging
significant features identified in prior research, our optimized
models achieved accuracy and F1-score of 92.33%,
evaluated through metrics like precision, recall, and specificity.
Compared to similar studies, the models showed enhanced performance
due to the larger dataset and hyperparameter tuning. This research
demonstrates the potential of machine learning for early heart
disease diagnosis, aiming for future real-time clinical
applications.
Introduction
Cardiovascular diseases (CVDs) are a major global health concern,
responsible for a substantial proportion of worldwide mortality.
According to the World Health Organization (2021), approximately
17.9 million people died from CVDs in 2019, representing 32% of all
deaths globally. Of these, 85% were due to a heart attack and
stroke. While major progress has been made in medical diagnostics,
early and accurate prediction of heart attack risk remains critical
in reducing mortality rates.
In recent years, machine learning has emerged as a promising tool for
predictive healthcare analytics, offering the potential to enhance
early diagnosis by identifying patterns in complex datasets that may
not be apparent to traditional medical analyses. In this project, we
employed two machine learning models — Decision
Tree and Multi-Layer Perceptron (MLP)
neural network — trained on a large dataset to predict the
likelihood of heart attacks. We specifically chose features proven
to be significant in heart attack prediction based on a review of
more than five research papers.
The dataset used in this study was generated by combining five
publicly available datasets, creating a comprehensive dataset of
1,888 rows and 14 attributes after dropping missing data.
Hyperparameter tuning and performance evaluation were conducted for
both models. Additionally, we calculated feature importance to
understand which factors played a critical role in the model’s
predictions. However, while feature importance was analyzed, it was
not directly used in model tuning.
The following sections will detail the methodology, hyperparameter
tuning processes, and the resulting performance of both models. We
will also compare our findings with existing research to highlight
the advancements made in heart attack prediction using machine
learning.
You can access the dataset we used and
uploaded on Kaggle here, and the code for the model implementation
can also be found on my Kaggle notebook here.
Additionally, the full source code is available on GitHub.
You can also check out the video presentation of the system down
below.
Research Gap
Despite several studies on heart attack prediction using machine
learning have been conducted, there still lies several gaps:
- Limited Dataset Size: Many studies have relied
on small datasets, limiting the generalizability of their
models. By merging four public datasets, we sought to address
this gap and provide a more robust dataset to train our models
(Alshraideh, et al., 2024).
- Exclusion of Critical Features: Age, sex, cp,
restecg, thalach, exang, oldpeak, slope, ca, and thal are found
as most relevant attributes in predicting heart diseases
(Chellammal & Sharmila, 2019). However, some research models
exclude these critical features (Hossain, et al., 2023). Our
dataset incorporates the critical attributes for predicting
heart diseases.
The findings from this study aim to fill these gaps by using a
larger, more diverse dataset and by incorporating critical health
features that are often overlooked in prior research.
Data Collection & Pre-processing
We compiled a comprehensive dataset by merging five public heart
disease datasets from Kaggle and one from Figshare. This larger
dataset provides a richer set of patient data, which will enhance
the training and testing of the machine learning models. The
datasets used are detailed in the table below.
Dataset Details
Key Features in the Dataset:
- age: The age of the patient
- sex: Gender (1 = male, 0 = female)
- cp: Chest pain type (four categories)
- trestbps: Resting blood pressure (in mm Hg)
- chol: Serum cholesterol in mg/dl
- fbs: Fasting blood sugar (1 = >120 mg/dl, 0 = otherwise)
- restecg: Resting electrocardiographic results (three categories)
- thalach: Maximum heart rate achieved
- exang: Exercise-induced angina (1 = yes, 0 = no)
- oldpeak: ST depression induced by exercise
- slope: Slope of the peak exercise ST segment
- ca: Number of major vessels colored by fluoroscopy
- thal: Thalassemia (four categories)
- target: Risk of heart attack (1 = high, 0 = low)
Preprocessing
Data Cleaning
The initial combined dataset contained 2,181 rows and fourteen
columns. Upon inspection, 293 rows were found to contain missing
data across key features. The most significant missing data was in
the ca (291 missing values), thal (266 missing values), and slope
(190 missing values) columns. Rather than imputing the missing
values, we chose to delete these rows, leaving 1,888 rows for
training and testing.
The decision to delete rows instead of imputing was driven by:
- High Proportion of Missing Data: Features like
ca and thal had sizable portions of missing values. Imputing
such extensive missing data could introduce bias and reduce the
model’s reliability.
- Maintaining Data Integrity: Deleting incomplete
rows ensured the dataset’s consistency, reducing the risk of
introducing unreliable or biased data through imputation.
Standardization of Feature Names
The datasets used different naming conventions for features such as
trestbps, exang, ca, thal, target, and slope. To ensure consistency
across the combined dataset, we standardized all feature names to
maintain uniformity. This step was essential for proper feature
alignment during the merging process and subsequent model training
and evaluation.
Feature Selection
All 14 features were retained based on their proven significance in
predicting heart disease risk, as highlighted in multiple research
studies. These features include age, cholesterol levels, resting
blood pressure, and ECG outcomes. By retaining all features, we
ensured the models had access to sufficient information to
accurately predict heart attack risk.
Machine Learning Models
1. Decision Tree Classifier
Decision Trees (DTs) are a non-parametric supervised learning method
used for classification and regression. The objective is to build a
model that, by utilizing basic decision rules deduced from the data
features, predicts the value of a target variable (scikit-learn,
n.d.).
Best Parameters:
- Criterion: Gini impurity
- Splitter: Best
- Max Depth: 5
- Type of Pruning: ccp_alpha
- Random State: 8412 (yielded the best accuracy)
Key Metrics for Decision Tree Classifier:
- Accuracy: 92.33%
- Precision: 92.33%
- Recall: 92.33%
- F1-Score: 92.33%
The Decision Tree model’s performance exceeded expectations,
achieving a near-perfect classification of high and low heart attack
risks.
2. Multilayer Perceptron (MLP)
The Multilayer Perceptron (MLP) is a type of artificial neural
network (ANN) that consists of multiple layers of neurons, including
an input layer, hidden layers, and an output layer (Chan, et al.,
2023).
Best Hyperparameters:
- Hidden Layers: 2 layers with 50 neurons each (after
hyperparameter tuning)
- Activation Function: Logistic (best-performing activation
function)
- Batch Size: 200 (best-performing batch size)
- Learning Rate: Constant (best-performing learning rate)
- Epochs: 1000 (optimal number of epochs)
Key Metrics for MLP Classifier:
- Accuracy: 92.33%
- Precision: 92.39%
- Recall: 92.33%
- F1-Score: 92.33%
We performed extensive hyperparameter tuning to find the best set of
parameters that yielded the highest accuracy. The MLP classifier
algorithm was adjusted to loop through various hyperparameters,
including the number of neurons, hidden layers, activation
functions, and batch sizes. The best-performing configuration
consisted of 2 hidden layers with 50 neurons each, a batch size of
200, a constant learning rate, logistic activation function, and
1000 epochs.
Results & Comparison with Other Research
1. Decision Tree Performance:
Our Model: Achieved an accuracy of 92.33%, with
precision, recall, and F1-score of
0.923.
Comparison 1: : A study that applied the Jellyfish
Optimization Algorithm to a Decision
Tree model reported an accuracy of 97.55% (Ahmad & Polat, 2023).
Their higher
accuracy could suggest that the Jellyfish Optimization Algorithm
would be a better fit
for this scenario.
Comparison 2: : Another study using the Particle
Swarm Optimization (PSO) technique
reported an accuracy of 85.71% with a Decision Tree (Alshraideh, et
al., 2024). Our
model’s 92.33% accuracy significantly exceeds this by 6.62%, showing
the substantial
impact of using a larger dataset and the importance of carefully
tuning model
parameters.
2. Neural Network (ANN) Performance:
Our Model: Achieved an accuracy of 92.32%, with
precision, recall, and F1-score all
around 0.923%.
Comparison 1: In another study, an ANN-based model
reported an accuracy of
73.33% using the same dataset attributes (Rabbi, et al., 2018). The
19.00% increase in
accuracy for our model demonstrates the critical role that dataset
size and
hyperparameter tuning play in model performance. Our larger dataset,
combined with
optimized ANN configurations, allowed for significantly better
results.
Comparison 2: A CNN-based heart disease prediction
model achieved an accuracy of
91.71% (Arooj, et al., 2022). While CNNs are known for their power
in image and
structured data classification, our simpler ANN model slightly
outperformed this with
92.32% accuracy. This further highlights the effectiveness of
dataset size and
optimization in achieving competitive results even with a relatively
simpler model
architecture.
Future Work
For future work, we plan to use optimization algorithms, such as
Particle Swarm Optimization (PSO) or Genetic Algorithms, to enhance
both the Decision Tree and MLP models. Additionally, incorporating
real-time clinical data and validating the model in a live
healthcare environment would further demonstrate its applicability
in real-world scenarios.
Conclusion
Our project highlights the potential of machine learning models to
significantly improve heart attack prediction. By leveraging a large
and diverse dataset, employing rigorous preprocessing methods, and
optimizing hyperparameters, we were able to achieve high accuracy
rates. These results suggest that machine learning, when properly
tuned, can be a valuable tool in assisting healthcare professionals
with the early diagnosis of heart disease, ultimately saving lives.
Thank you for reading!
References
- Ahmad, A. & Polat, H., 2023. Prediction of Heart Disease Based
on Machine Learning Using Jellyfish Optimization Algorithm.
Diagnostics, 13(14), pp. 2392–2392.
- Alshraideh, M. et al., 2024. Enhancing Heart Attack Prediction
with Machine Learning: A Study at Jordan University Hospital.
Applied Computational Intelligence and Soft Computing.
- Anand, N., 2018. Heart Attack Prediction. [Online] Available at:
https://www.kaggle.com/datasets/imnikhilanand/heart-attack-prediction
[Accessed 10 September 2024].
- Arooj, S. et al., 2022. A Deep Convolutional Neural Network for
the Early Detection of Heart Disease. Biomedicines.
- Chan, K. Y. et al., 2023. Deep neural networks in the cloud:
Review, applications, challenges and research directions.
Neurocomputing.
- Chellammal, S. & Sharmila, R., 2019. Recommendation of
Attributes for Heart Disease Prediction using Correlation
Measure. International Journal of Recent Technology and
Engineering, 8(2S3), pp. 870–875.
- Damarla, R., 2020. Heart Disease Prediction. [Online] Available
at: https://www.kaggle.com/datasets/rishidamarla/heart-disease-prediction/data
[Accessed 10 September 2024].
- Hossain, M. I. et al., 2023. Heart disease prediction using
distinct artificial intelligence techniques: performance
analysis and comparison.
- Lapp, D., 2019. Heart Disease Dataset. [Online] Available at: https://www.kaggle.com/datasets/johnsmith88/heart-disease-dataset
[Accessed 10 September 2024].
- Nandal, N., 2022. heart.csv. [Online] Available at: https://figshare.com/articles/dataset/heart_csv/20236848?file=36169122
[Accessed 10 September 2024].
- Rabbi, M. F. et al., 2018. Performance Evaluation of Data Mining
Classification Techniques for Heart Disease Prediction.
Journal of Engineering Research.
- Rahman, R., 2021. Heart Attack Analysis & Prediction Dataset.
[Online] Available at: https://www.kaggle.com/datasets/rashikrahmanpritom/heart-attack-analysis-prediction-dataset/data
[Accessed 10 September 2024].
- scikit-learn, n.d. [Online] Available at: https://scikit-learn.org/stable/modules/tree.html
[Accessed 14 September 2024].
- World Health Organization, 2021. Cardiovascular diseases (CVDs).
[Online] Available at: https://www.who.int/news-room/fact-sheets/detail/cardiovascular-diseases-(cvds)
[Accessed 10 September 2024].