Diagnosis and Identification of Risk Factors for Heart Disease Patients Using Generalized Additive Model and Data Mining Techniques

Mukherjee, Kapoor, and Banerjee: Diagnosis and Identification of Risk Factors for Heart Disease Patients Using Generalized Additive Model and Data Mining Techniques



The heart is the most essential organ of human body which also can be described as the size of a fist and a strong muscle in the body. Any disorderliness that affects the heart from infection to genetic defects and blood vessel disease is referred to as heart disease.1 Heart disease is a serious disease and proper diagnosis of heart disease at early stage remains challenging task.2 In fact, up to 25% of people with heart disease have no symptoms despite insufficient blood flow to the heart, a condition that is referred to as silent heart disease.3 In the United State of America about 600,000 people die as a result of heart disease every year which is calculated to be one in every four deaths.4 Diagnosis usually appears when a patient visits the doctor to have symptoms checked out. Patients may be met with shortness of breath, pain in the chest or back, painful, persistent coughing or any number of other symptoms, none of which immediately alert the doctor to a diagnosis of heart disease. Many studies were carried out about heart disease diagnosis in all over the world generally using by artificial intelligence techniques or data mining methods.5-8 The use of data mining techniques in medical diagnosis has been increasing gradually. There is no doubt that evaluations of data taken from patients and decisions of experts are the most important factors in diagnosis. However, sometimes different artificial intelligence techniques or machine learning techniques are used for disease diagnosis.5-9-11

In health care, data mining or statistical machine learning plays a vital role in the medical applications including diagnosis, prognosis, and therapy.12 Clinical data mining involves the conceptualization, extraction, analysis, and interpretation of the available clinical data for practical knowledge-building, clinical decision making, and partition reflection.12 A medical diagnosis is a classification problem13 In the predictive data mining, the data set consists of instances, each instance is characterized by attributes or features and another special attribute represents the outcome variable or the class.14 Often, the goal of any data mining project is to build a model from the available data. Thus, data mining models are objective models rather than subjective since it is driven by the available data.

Data mining (DM) techniques15 aim at extracting high-level knowledge from raw data. There are several DM algorithms, each one with its own advantages. DM techniques perform regression and classification tasks. In case of neural networks (NNs), the back propagation algorithm was first introduced in 197416 and later popularized in 1986.17 Since then, neural networks (NNs) have become increasingly used. More recently, support vector machines (SVMs) have also been proposed.18,19 Due to their higher exibility and nonlinear learning capabilities, both NNs and SVMs are gaining an attention within the DM field, often attaining high predictive performances.20,21 SVMs present theoretical advantages over NNs, such as the absence of local minima in the learning phase. In effect, the SVM was recently considered one of the most influential DM algorithms.22 Therefore in this paper, a study of SVM on heart disease diagnosis was realized.

In the statistical analysis of clinical trials and observational studies, the identification and adjustment of prognostic factors is an important activity in order to get valid outcome. The failure to consider important prognostic variables, particularly in observational studies, can lead to errors in estimating treatment differences. In addition, incorrect modeling of prognostic factors can result in the failure to identify nonlinear trends or threshold effects on survival. This article describes flexible statistical methods that may be used to identify and characterize the effect of potential prognostic factors on disease endpoints. These methods are called ‘Generalized Additive Models’ (GAM).23 Many mathematical and statistical methodologies for building classification models, from the classical statistical methods to machine learning theory to classification trees, are reviewed and compared.24-27 Many work and research has been done into better and accurate models for the Heart Disease Dataset. The work28 gives a knowledge driven approach. Initially Logistic Regression was used by Dr. Robert Detrano for heart disease diagnosis.29 Newton Cheung utilized C4.5, Naive Bayes, BNND and BNNF algorithms and reached the classification accuracies of 81.11%, 81.48%, 81.11% and 80.96%, respectively.30 proposed a method that uses artificial immune system (AIS) and obtained more classification accuracy than the previous works.31 shows comparative results of many study performed on this heart disease data.10 In this present article 10-flod cross-validation along with 5 runs in each experiment has been performed for getting more stability in classification accuracy rate. Aim of the present article is to explore a relationship between chance of having heart disease of a patient with others biomedical parameters as a cofactors. Due to complex relationship between cofactors and response variable, GAM has been introduced here for better accuracy in prediction. The another aim of this study is to find a best classifier which gives a good performance evolution measures and also try to find the important input variables for heart disease diagnosis using strong data mining techniques. Many authors had used various classification techniques to this dataset for heart disease diagnosis.5-11 but probably, SVM and MPLE are not been used under proper modeling scheme. This study shows high classification accuracy rate and presented a significant variable input importance chart for heart disease diagnosis.

In this research work, we used the heart disease dataset obtained from the UCI Machine Learning to develop intelligent systems using data mining and GAM for diagnosis of heart disease. The results obtained from these systems were compared and the highest recognition rate obtained was taken as the best system for diagnosis of heart disease. This system will solve the problem of misdiagnose of heart disease and also try to identify the risk or important biomedical parameters responsible for probable heart disease. This can guide the doctors about prognostic factors and patients for greater awareness regarding heart disease.



The present article is considered 270 heart disease patients with 14 factors or variables. The current secondary data set is taken from the report. The data set can be downloaded at http://archive.ics.uci.edu/ml/datasets.html. Description of the covariates, factors and their levels are described in Table 1. The summarized statistics such as the mean, standard deviation, and proportion of the levels are given in Table 1. The current data contains 5 continuous variables and 9 attribute characters. The description of each variable or attribute character, attribute levels, and how they are operationalized in the present report is displayed in Table 1. Here present or absent of heart disease in patient is playing a role of dependent variable (for regression) or output variable (for classification) and rest of the variables are playing the role of independent variables/ cofactors.


In this present article data mining techniques with sensitivity analysis is performed for diagnosis of the heart disease and tried to find out the important factors which are most responsible in this diagnostic work respectively. Apart from this, the generalized additive logistic models are also applied to find the risk factors for heart disease. In case of data mining Multi-Layer Perceptrons ensembles (MLPE), Support vector machines (SVM) are used for classification and there after Sensitivity analysis done only upon the best model out of this classifier for this heart disease data set.20

Best GAM32 model can be selected through some model checking criteria namely R square value, AIC or UBRE value and regression diagnostic plots like normal probability plot, Residuals against fitted value plot etc.14,32 Cofactors are significant or not judged through p-value. For this heart disease data set Absence and presence of heart disease is taken as response variable (Y), and Age, Sex, chest pain type, resting blood pressure, serum cholesterol, fasting blood sugar, resting ECG results, maximum heart rate achieved, exercise induced angina, oldpeak, slope of the peak exercise ST segment, number of major vessels, thal (thallium scan) are the cofactors (X’i s).

Data mining techniques want to classify the data using different classifiers whereas GAM wants to identify the risk factors for this disease. The brief descriptions of the used methods are given below.

Data Mining Techniques

DM is an iterative process that consists of several steps. The CRISP-DM,33 a tool-neutral methodology supported by the industry (e.g. SPSS, DaimlerChryslyer) partitions a DM project into 6 phases: 1. business understanding; 2. data understanding; 3. data preparation; 4. modeling; 5. evaluation; and 6. deployment.

This work addresses steps 4 and 5, with an emphasis on the use of NNs and SVMs to solve classification and regression goals. Both tasks require a supervised learning, where a model is adjusted to a dataset of examples that map I inputs into a given target. In case of classification models output a probability p(c) for each possible class c, such that c=1Ncpc=1. For assigning a target class c, one option is to set a decision threshold D ϵ 0,1 and then output c if p(c) > D, otherwise return c. This method is used to build the receiver operating characteristic (ROC) curves. Another option is to output the class with the highest probability and this method allows the definition of a multi-class confusion matrix. For more details see.34

To evaluate a model for classification, common metrics are.35 ROC area (AUC), confusion matrix, accuracy (ACC), true positive/negative rates (TPR/TNR). A classifier should present high values of ACC, TPR, TNR and AUC. The model’s generalization performance is often estimated by the holdout validation (i.e. train/test split) or the more robust k-fold cross-validation.14 The latter is more robust but requires around k times more computation, since k models are fitted.

MLPE neural network model

In DM techniques, NN means the popular multilayer perceptron (MLP). A major concern in their use is the difficulty to define the proper network for a specific application, due to the sensitivity to the initial conditions and to overfitting and underfitting problems which limit their generalization capability. A very promising way to partially overcome such drawbacks is the use of MLP ensembles (MLPE); averaging and voting techniques are largely used in classical statistical pattern recognition and can be fruitfully applied to MLP classifiers. For classification problem MLPE are used, which is a combinations of MLP models. This network includes one hidden layer of H neurons with logistic functions (Figure 1 (a)). The overall model is given in the form:

yi=fi(wi,0 +j=I+1I+Hfi(Σn=1Ixnwm,n+wm,0)wi,n)

Where is the output of the network for node i, wi,j is the weight of the connection from node j to i and fi is the activation function for node j. For a binary classification (Nc = 2), there is one output neuron with a logistic function. Under multi-class tasks (Nc > 2), there are linear output neurons and the softmax function is used to transform these outputs into class probabilities:


Where is the predicted probability and is the NN output for class i. The training (BFGS algorithm) is stopped when the error slope approaches zero or after a maximum of epochs. For classification it maximizes the likelihood.14 Since NN training is not optimal, the final solution is dependent of the choice of starting weights. To solve this issue, the solution adopted is to train different networks and then select the NN with the lowest error or use an ensemble of all NNs and output the average of the individual predictions.14 In general, ensembles are better than individual learners.36 The final NN performance depends crucially on the number of hidden nodes. The simplest NN has H = 0, while more complex NNs use a high H value.

Support Vector Machine (SVM) model

When compared with NNs, SVMs present theoretical advantages, such as the absence of local minima in the learning phase.14 The basic idea is transform the input xRI into a high m-dimensional feature space by using a nonlinear mapping. Then, the SVM finds the best linear separating hyperplane, related to a set of support vector points, in the feature space (Figure 1 (b)). The transformation (φ(x)) depends of a kernel function.

Here, SVM uses the sequential minimal optimization (SMO) learning algorithm adopting the popular Gaussian kernel, which presents less parameters than other kernels (e.g. polynomial): K(X,X′) = exp(−γX − X′2), γ > 0. The classification performance is affected by two hyperparameters:, the parameter of the kernel, and C, a penalty parameter. The probabilistic SVM output is given by 37


Where m is the number of support vectors, yi {-1,1}; is the output for a binary classification, and are coefficients of the model, and A and B are determined by solving a regularized maximum likelihood problem. When Nc>2, the one-against-one approach is used, which trains Nc(Nc-1)/2 binary classifiers and the output is given by a pairwise coupling.37

Sensitivity Analysis

The sensitivity analysis is a simple procedure that is applied after the training procedure and analyzes the model responses when a given input is changed. Let ya,j denote the output obtained by holding all input variables at their average values except xa, which varies through its entire range (xa,j, with j ϵ {1,2,…..L} levels). Variance (Va) of ya,j used as a measure of input relevance.38 If Nc>2 (multi-class), it sets as the sum of the variances for each output class probability (p(c)a,j). A high variance (Va) suggests a high xa relevance, thus the input relative importance (Ra) is given by:


For a more detailed analysis, the variable effect characteristic (VEC) curve, Cortez et al. has been proposed, which plots the xa,j values (x-axis) versus the ya,j predictions (y-axis).39

Generalized Additive Model (GAM)

GAM32,-40 is an extension of the Generalized Linear Model (GLM)41 where the modeling of the mean functions relaxes the assumption of linearity, albeit additivity of the mean function pertaining to the covariates is assumed. Whilst the mean functions of some covariates may be assumed to be linear, the non-linear mean functions are modeled using smoothing methods, such as kernel smoothers, lowess, smoothing splines or regression splines. In general, the model has the following structure


where, μ=E(Y) for a response variable with some exponential family distribution, g is the link function and fi are some smooth functions of the covariates Xi for each j=1,2,…..,p.

GAMs provide more flexibility than do GLMs, as they relax the hypothesis of linear dependence between the covariates and the expected value of the response variable. The main drawback of GAMs lies in the estimation of the smooth functions fi, and there are different ways to address this. One of the most common alternatives is based on splines, which allow the GAM estimation to be reduced to the GLM context.42 Smoothing splines,43 use as many knots as unique values of the covariate Xi and control the model’s smoothness by adding a penalty to the least squares fitting objective.44,45

Generalized additive models can be used in virtually any setting where linear models are used. For a single observation (ith )the basic idea is to replace j=1pxijβj, the linear component of the model with an additive component j=1pfj(xij).

In the logistic regression model the outcome yi is ‘0’ or ‘1’ with ‘1’ indicating an event and ‘0’ indicates no event. (In this article ‘1’ indicates absence of heart disease and ‘0’indicates presence of the heart disease in patient). Then the generalized additive logistic model assumes the log-odds are given below


Where f1,f2,….,fp are the smooth functions which are estimated by splines algorithm. For more details see these references.23-32

Performance Evolution Measures
Classification Accuracy (ACC)

Classification accuracy refers to the ability of the model to correctly predict the class level of new or previous unseen data. Classification Accuracy is the percentage (%) of testing set examples correctly classified by the classifier. The quality of classification can be assessed through overall accuracy. That is


Where T is the set data items to be classified (the test set in this case), t∈T,t.c is the class of item t, and (t) returns the classification of by the used classifier (here, SVM and MLPE). For more details see.46

Area under Curve (AUC)

AUC is a common evaluation metric for binary classification problems. Consider a plot of the true positive rate vs. the false positive rate as the threshold value for classifying an item as 0 or is increased from 0 to 1 and if the classifier is very good, the true positive rate will increase quickly and the area under the curve will be close to 1. One characteristic of the AUC is that it is independent of the fraction of the test population which is class 0 or class 1; this makes the AUC useful for evaluating the performance of classifiers on unbalanced data sets.

k-fold Cross Validation

k-fold cross validation is a common technique for estimating the performance of a classifier. Given a set of m training examples, a single run of k-fold cross validation proceeds as follows:

  1. Arrange the training examples in a random order.

  2. Divide the training examples into k-folds. (k chunks of approximately m/k examples each.)

  3. For i=1,2,…..k:

    • (i) Train the classifier using all the examples that do not belong to fold.

    • (ii) Test the classifier on all the examples in fold.

    • (iii) Compute, the number of examples in fold that were wrongly classified.

  4. Return the following estimate to the classifier error:


To obtain an accurate estimate to the accuracy of a classifier, k-fold cross validation is run several times, each with a different random arrangement in Step- 1. After performing these steps several numbers of times takes an average of each run result to produced final classification accuracy. For more details see.14

All GAM regression and data mining works are performed in R statistical software with proper library packages.40-47 (http://www3.dsi.uminho.pt/pcortez/rminer.html),34


Table 2 presents the summarized results of Generalized Additive Model used for heart disease diagnosis. Here response variable is whether a patient has heart disease or not? Rest of the variables is cofactors. GAM has two parts of estimation methods; one is parametric estimation for those cofactors which entered in model parametrically and non-parametric estimation used for smoothing cofactors. In this present article only Age is the smoothing cofactors and rest are under parametric estimation method. The detailed results and interpretations of Table 2 (Binomial with logit link fitted model) are described as follows. The GAM regression coefficients give the change in the log odds of the Heart disease (response) for a one unit increase in the cofactors (predictor). Here we have considered the P-values up to approximately 10% level as significant, and more than 10% to approximately 20% as partially significant.40,41-49,50

Results of Estimation of Parametric coefficients

Heart disease (HD) is very high positively significantly associated with chest pain of a patient. Out of four types of chest pain, asymptomatic chest pain changes the log odds of HD by 2.7777 with p-value 0.0008. Therefore, patient having higher chance of HD if he/she has asymptomatic chest pain.

Table 1

Operationalization of variables with the analysis & summarized statistics

Variable nameOperationalizationMeanStandard deviationProportion of levels of Attributes
Age (Year)Age at study54.439.10---
SexGender : (Female = 1 ; Male = 2)------1= 32.22% ; 2= 67.78%
Chest PainChest pain type (1 = typical angina; 2 = atypical angina; 3 = non-anginal pain; 4 = asymptomatic)------1= 7.41% ; 2=15.56% ; 3=29.26% ; 4=47.78%
Resting BPResting blood pressure (in mm Hg on admission to the hospital)131.3417.86---
CholesterolSerum cholesterol in mg/dl249.6651.69---
Fasting BSFasting blood sugar > 120 mg/dl (1 = False; 2 = True)------1= 85.19% ; 2=14.81%
Resting ECGResting electrocardiographic results (1 = Normal; 2 = Having ST-T; 3 = Hypertrophy)------1=48.52% ; 2=0.74% ; 3=50.74%
Max HRMaximum heart rate achieved149.6823.17---
Exercise AngExercise induced angina (1 = No; 2 = Yes)------1=67.04% ; 2=32.96%
OldpeakST depression induced by exercise relative to rest1.051.14---
SlopeThe slope of the peak exercise ST segment (1 = Up sloping; 2 = Flat; 3 = Down sloping)------1=48.15% ; 2=45.19% ; 3=6.67%
VesselNumber of major vessels (0-3) colored by fluoroscopy. ( Treated as a discrete variable )------0=59.26% ; 1=21.48% ; 2=12.22%; 3=7.04%
ThalThallium heart Scan (1 = Normal; 2 = Fixed defect; 3 = Reversible defect)------1=56.30% ; 2=5.19% ; 3=38.52%
Heart diseaseDiagnosis of heart disease (1= Absence; 2= Presence)------1=55.56% ; 2=44.44%
Table 2

Results for GAM of Heart disease data analysis using Binomial distribution with ‘logit’ link

Estimation of Parametric coefficients
CovariatesEstimateStandard ErrorZ valuep-value
Intercept-6.6444232.600914-2.5550.010629 *
Chest Pain 21.4982810.9633071.5550.119862
Chest Pain 30.6627780.8240660.8040.421237
Chest Pain 42.7777480.8296413.3480.000814 ***
Cholesterol0.0098500.0045132.1830.029053 *
Max. HR-0.0326190.011325-2.8800.003974 **
Old peak0.5150730.2230072.3100.020906 *
Resting BP0.0243780.0118712.0530.040025 *
Resting ECG 22.1871533.5437050.6170.537107
Resting ECG 30.7686720.4396921.7480.080429.
Sex 22.0802820.6248563.3290.000871 ***
Thal 20.0639030.8457420.0760.939771
Thal 31.6939880.4770883.5510.000384 ***
Approximate Significance of smooth terms (Non-parametric)
Smooth CovariateEdfRef. dfChi.sqp-value

Edf: Estimated degrees of freedom; Ref.df: Degrees of freedom before smoothing; Chi. Sq: Chi square value.

Significance Level:‘***’ 0.001; ‘**’ 0.01; ‘*’ 0.05; ‘.’ 0.1.

R-sq.(adj) =0.697;Deviance explained = 64.3%; UBRE (Un biased risk estimator) = -0.24238

Table 3

Results of ACC and AUC heart disease dataset by 10 folds cross validation in 5 runs

ACC (Classification Accuracy Rate in %)AUC (Area Under Curve in 0-1)
Run Method1st2nd3rd4th5thAverage1st2nd3rd4th5thAverage

SVM: Support vector machine; MLPE: Multilayer perceptron ensembles.

In the GAM fitted model, for every one unit change in Cholesterol the log odds of HD increased by 0.0098 with p-value 0.029. Cholesterol has a positive significant association with HD which indicates that patients with high Cholesterol having a higher chance of HD.

HD is high negatively significantly associated with the Maximum Heart rate (Max.HR) of a patient. For every one unit change in Max. HR the log odds of HD decreased by 0.0326 with p-value 0.003. That means patients with maximum heart rate having lower risk of HD.

For one unit change in Old peak the log odds of HD increased 0.5150 with p-value 0.020.The HD is positively significantly associated with Old peak. Therefore patients with high Old peak value having higher risk of HD.

In this GAM fitted model, for every one unit change in Resting BP the log odds of HD increased by 0.0243 with p-value 0.040. Resting BP has a positive significant association with HD which indicates that patients with high Resting BP having a higher chance of HD.

Heart disease (HD) is positively significantly associated with Resting ECG of a patient. Out of three types of Resting ECG, Hypertrophy Resting ECG changes the log odds of HD by 0.7686 with p-value 0.080. Therefor patients having higher chance of HD if they have Hypertrophy Resting ECG result than others.

Sex (Gender) of a patient has a very positive significant association with HD. Male patient changes the log odds of HD by 2.0802 with p-value <0.001than a female patient. This indicates male patients having a higher chance of HD.

HD is very high positive significant association with Thallium heart scan (Thal) result. A patient with Reversible defect in his/her thallium heart scan report changes the log odd of HD by 1.6939 with p-value <0.001. It means patient has higher chance of HD if his/her thallium heart scan report shows Reversible defect than others.

Numbers of major vessels (Vessel) treated as a discrete variable in this GAM fitted model has a very high positive significant association with HD. For every one number increase in Vessel causes 1.2636 increment in log odds of HD with p-value <0.001.

Results of Non-parametric estimation for approximate significance of Smooth term

In this GAM fitted model only one cofactor namely Age, used as smoothing factor. As it is a nonparametric method of estimation so Chi-square test statistic has been used for testing the hypothesis. From table 2 it is observed that smoothness of the cofactor Age is partially significance with p-value 0.0957.

It also noticed from Table 2 that, the GAM fitted model has an Adjusted R-square value 0.70 with 65% of its deviance explained. UBRE (Un biased risk estimator) score is -0.2423 which is also very low compare to other models.

From Table 2, the final selected GAM fitted binary logistic model of the Heart disease (y) is shown below

log odds(HD)=-6.64+1.49Chest Pain2+0.66Chest Pain3+2.77Chest pain4+0.0098Cholesterol-0.03Max.HR+0.51Oldpeak+0.02RetingBP+2.18RestingECG2+0.76RestingECG3+2.08Sex2+0.06Thal2+1.69Thal3+1.26Vessel+f(Age)

In the above predictive formula, except Age all the cofactors entered in this additive model parametrically. Age is the only smoothing term here whose approximate significance has been judged through non-parametrical methods (Chi-Square test).

In Figure 2 and 3, the GAM diagnostic plots have been examined for binomial logit model. Figure 2(a) shows the histogram of the residuals for binomial logit GAM, which indicates that the residuals are normally distributed. Figure 2(b) represents the plot of the smooth terms for cofactor Age with confidence belt. It shows that the non-linearity with respect to its smoothness.

In Figure 3(a), the absolute residual values are plotted against the fitted values of GAM. This residual plot is completely a flat diagram indicating that the variance is constant with the respective means. Figure 3(b) reveals the normal probability plot for the fitted model, which shows no systematic departure or lack of fit, or response distribution, or variables or outliers with respect to the fitted GAM model.

Results of Data Mining Techniques

Table 3 presents the results of Data Mining Techniques for heart disease diagnosis. Mainly two classification methods SVM and MLPE are introduced for diagnosis. Two performance measures namely Classification accuracy rate (ACC) and Area under curve (AUC) are checked here using 10-flods cross validation with 5 runs in each experiment. It observed from Table 3 that for both of these two performance measures SVM is superior to MLPE. After 10-flods cross validation with 5runs the average ACC value for SVM is almost 85% whereas MLPE shows 82% accuracy rate. In case of AUC value SVM and MLPE show almost 0.90 and 0.86 respectively.

In Figure 4, the plots from sensitivity analysis under SVM are shown. Figure 4(a) shows the Input importance bar charts for heart disease diagnosis. Maximum heart rate is most important input variables for heart disease diagnosis under SVM (best classifier out of all data mining techniques). Figure 4(b) shows the variable effective curve (VEC) for Max HR and it is decreasing, results form Table-2 also suggests this.


The current article is considered the Heart Disease/HD (whether a patient has a heart disease or not) as the response variable. It is a binary variable with values ‘1’ and ‘2’ which stand for absent and present of the heart disease respectively. This HD has been modeled based on generalized additive model. The GAM fitted model results are displayed in Table 2.

Figure 1

Data Mining Techniques (a) Multi-Layer Perceptron Neural Network (MLPE)(b) Support Vector machine (SVM)

Figure 2(a)

Histogram of residuals.

Figure 2(b)

Smoothing term (Age) plot with confidence belt.

Figure 3(a)

Absolute residual plot.

Figure 3(b)

Normal probability plots of residuals.

Figure 4(a)

Input Importance Chart.

Figure 4(b)

Variable effective Curve for Max. HR(most important input variable).


The current reported results (Table 2), though not completely conclusive, are revealing. The determinants of HD are derived satisfying the following regression analysis criteria. First, the determinants are selected based on GAM fitted model analyses. Second, the final model is selected based on UBRE.40-47 Third, final model is justified based on GAM diagnostic plots. Fourth, the standard error of the estimates is very small, indicating that the estimates are stable 48

Fifth, the final model of the HD is selected based on locating the appropriate statistical distribution. The HD distribution is identified herein as the binomial distribution. For more extension regarding this please follow the references.49,50

To the best of our knowledge, the present models (Results & interpretation section) can be considered as one of the best first building block of a regression analysis. The current models may provide better assistance for treatment decision making using the individual patient risk factors and the benefits of a specific treatment. The current results have focused many interesting conclusions. These findings may help the medical practitioners for better medical treatment. Thallium scan report, Chest pain type are highly important for identification of a heart disease patients. Especially for male patient, it is recommended that they must take care about their heart during their older age.


We would like to acknowledge all the previous authors who had work on this data set and also the UCI Machine Learning Repository for making available of this dataset. Finally, we are very much thankful to the reviewers for their valuable comments for betterment of this article.


[5] Conflicts of interest CONFLICT OF INTEREST None declared.



Support vector machine


Multi layer perceptron ensemble


Multi layer perceptron


Generalized additive model


Heart disease


Data mining


Variable effective curve



Pampel FC, Pauley S , authors. Progress against heart disease. Greenwood Publishing Group; 2004


Lahsasna A, Ainon RN, Zainuddin R, Bulgiba AM , authors. A transparent fuzzy rule-based clinical decision support system for heart disease diagnosis. Knowledge Technology. 2012;295(2):62–71


Yaron G , author. Symptoms and Complications of Heart_Disease. [Online]: www.itamar-medical.comPatient_Information/Cardio_101/.


Bhasin M, Raghava GP , authors. Analysis and prediction of affinity of TAP binding peptides using cascade SVM. Protein Science. 2004;13(3):596–607


Parthiban G, Srivatsa SK , authors. Applying machine learning methods in diagnosing heart disease for diabetic patients. International Journal of Applied Information Systems (IJAIS). 2012;3:2249–0868


Prerana THM, Shivaprakash NC, Swetha N , authors. Prediction of Heart Disease Using Machine Learning Algorithms- Naïve Bayes, Introduction to PAC Algorithm, Comparison of Algorithms and HDPS. International Journal of Science and Engineering. 2015;3(2):90–9


Ghumbre S, Patil C, Ghatol A , authors. Heart Disease Diagnosis using Support Vector Machine. International Conference on Computer Science and Information Technology (ICCSIT’2011) Pattaya. Dec;2011;


Rajkumar A, Reena GS , authors. Diagnosis of heart disease using datamining algorithm. Global journal of computer science and technology. 2010;10(10):38–43


Olaniyi EO, Oyedotun OK, Adnan K , authors. Heart diseases diagnosis using neural networks arbitration. International Journal of Intelligent Systems and Applications. 2015;7(12):72


Khanna D, Sahu R, Baths V, Deshpande B , authors. Comparative Study of Classification Techniques (SVM, Logistic Regression and Neural Networks) to Predict the Prevalence of Heart Disease. International Journal of Machine Learning and Computing. 2015;5(5):414


Mythili T, Mukherji D, Padalia N, Naidu A , authors. A heart disease prediction model using SVM-Decision Trees-Logistic Regression (SDL). International Journal of Computer Applications. 2013;68(16)


Shomona GJ, Ramani GR , authors. Data Mining in Clinical Data Sets: A Review. International Journal of Applied Information Systems. 2012;4(6):15–26


Saidi M, Chikh MA, Settouti N , authors. Automatic identification of diabetes diseases using a modified artificial immune recognition system2 (MAIRS2). InProceedings of 3ème conference internationale sur l ‘informatique et ses applications 2011;


Bellazzi R, Zupan B , authors. Predictive data mining in clinical medicine: current issues and guidelines. International journal of medical informatics. 2008;77(2):81–97


Witten IH, Frank E , authors. Data mining: practical machine learning tools and techniques. Morgan Kaufmann. 2005;


Werbos PJ , author. Beyond regression: New tools for prediction and analysis in the behavioral sciences. Doctoral Dissertation, Applied Mathematics. Harvard University; MA: 1974


Rumelhart D, Hinton G, Williams R , authors. Learning Internal Representations by Error Propagation. (Book chapter -8) Parallel Distributed Processing: Explorations in the Microstructures of Cognition. 1986. 1. p. 318–362. MIT Press; Cambridge:


Boser BE, Guyon IM, Vapnik VN , authors. A training algorithm for optimal margin classifiers. In: InProceedings of the fifth annual workshop on Computational learning theory; 1992 Jul 1; p. 144–152. ACM


Smola AJ, Schölkopf B , authors. A tutorial on support vector regression. Statistics and computing. 2004;14(3):199–222


Hastie T, Tibshirani R, Friedman J , authors. The Elements of Statistical Learning: Data Mining, Inference and Prediction. 2nd. 2008. Springer-Verlag; NY, USA:


Huang Z, Chen H, Hsu CJ, Chen WH, Wu S , authors. Credit rating analysis with support vector machines and neural networks: a market comparative study. Decision support systems. 2004;37(4):543–58


Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng A, Liu B, Philip SY, Zhou ZH , authors. Top 10 algorithms in data mining. Knowledge and information systems. 2008;14(1):1–37


Hastie T, Tibshirani R , authors. Generalized additive models for medical research. Statistical Methods in Medical Research. 1995;4:187–196


Dudoit S, Fridlyand J, Speed TP , authors. Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American statistical association. 2002;97(457):77–87


Lee JW, Lee JB, Park M, Song SH , authors. An extensive comparison of recent classification tools applied to microarray data. Computational Statistics & Data Analysis. 2005;48(4):869–85


Li T, Zhang C, Ogihara M , authors. A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression. Bioinformatics. 2004;20(15):2429–37


Liao JG, Chin KV , authors. Logistic regression for disease classification using microarray data: model selection in a large p and small n case. Bioinformatics. 2007;23(15):1945–51


Nahar J, Imam T, Tickle KS, Chen YP , authors. Computational intelligence for heart disease diagnosis: A medical knowledge driven approach. Expert Systems with Applications. 2013;40(1):96–104


Detrano R, Janosi A, Steinbrunn W, Pfisterer M, Schmid JJ, Sandhu S, et al. , authors. International application of a new probability algorithm for the diagnosis of coronary artery disease. The American journal of cardiology. 1989;64(5):304–10


Cheung N , author. Machine learning techniques for medical analysis. B.Sc.Thesis. School of Information Technology and Electrical Engineering; University of Queenland: 2001


Polat K, Sahan S, Kodaz H, Günes S , authors. A new classification method to diagnosis heart disease: Supervised artificial immune system (AIRS). In: Inproceedings of the turkish symposium on artificial intelligence and neural networks (TAINN) 2005;


Hastie T, Tibshirani R , authors. Generalized additive models. John Wiley & Sons, Inc.; 1990


Chapman P, Clinton J, Kerber R, Khabaza T, Reinartz T, Shearer C, Wirth R , authors. CRISP-DM 1.0: Step-by-step data mining guide. CRISP-DM consortium. 2000;


Cortez P , author. A tutorial on the rminer R package for data mining tasks, Teaching Report. Department of Information Systems, ALGORITMI Research Centre, Engineering School; University of Minho, Guimarăes, Portugal: 2015


Witten IH, Frank E , authors. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. 2nd. Morgan Kaufmann; San Francisco, CA: 2005


Rocha M, Cortez P, Neves J , authors. Evolution of Neural Networks for Classification and Regression. Neurocomputing. 2007;70:2809–16


Wu TF, Lin CJ, Weng RC , authors. Probability estimates for multi-class classification by pairwise coupling. Journal of Machine Learning Research. 2004;5:975–1005


Kewley RH, Embrechts MJ, Breneman C , authors. Data strip mining for the virtual design of pharmaceuticals with neural networks. IEEE Transactions on Neural Networks. 2000;11(3):668–79


Cortez P, Teixeira J, Cerdeira A, Almeida F, Matos T, Reis J , authors. Using Data Mining for Wine Quality Assessment. InDiscovery Science. 2009;5808:66–79


Wood SN , author. Generalized Additive Models: An Introduction with R. London: Chapman and Hall; 2006


Myers RH, Montgomery DC, Vining GG, Robinson TJ , authors. Generalized linear models: with applications in engineering and the sciences. John Wiley & Sons; 2012


Currie ID, Durban M, Eilers PH , authors. Generalized linear array models with applications to multidimensional smoothing. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2006;68(2):259–80


Green PJ, Silverman BW , authors. Nonparametric regression and generalized linear models: a roughness penalty approach. CRC Press; 1993


Ruppert D , author. Selecting the number of knots for penalized splines. Journal of computational and graphical statistics. 2002;11(4):735–57


Eilers PH, Marx BD , authors. Flexible smoothing with B-splines and penalties. Statistical science. 1996;1:89–102


Watkins A , author. AIRS: a resource limited artificial immune classifier. Master thesis. MississippiState University; 2001


Ruppert D, Wand MP, Carroll RJ , authors. Semi parametric Regression. first. Cambridge University Press; New York: 2003


Chatterjee S, Hadi AS , authors. Regression Analysis by Example. fifth. John Wiley & Sons; New Jersey: 2006


Das RN, Mukherjee S, Panda RN , authors. Association between Body Mass Index and Cardiac Parameters of Worcester Heart Attack Study. BAOJ Cell Mol Cardio. 2016;2:006


Das RN, Mukherjee S , authors. Joint Mean-Variance Overall Survival Time Fitted Models from Stage III Non-Small Cell Lung Cancer. Epidemiology (Sunnyvale). 2017;7:296