Epilepsy is a disorder of the central nervous system (CNS), affecting about 1.2% (3.4 million people) in the US, and more than 65 million globally. Additionally, about 1 in 26 people will develop epilepsy at some point during their lifetime. There are many kinds of seizures, each with different symptoms, such as losing consciousness, jerking movements, or confusion. Some seizures are much harder to detect visually; the patients will usually exhibit symptoms such as not responding or staring blankly for a brief period. Seizures can happen unexpectedly and can result in injuries such as falling, biting of the tongue, or losing control of one’s urine or stool. Hence, these are some of the reasons why seizure detection is of utmost importance for patients under medical supervision that are suspected to be seizure prone. This project will use binary classification methods to predict whether an individual is having a seizure or not.
The dataset is available on UCI’s machine learning repository here. The dataset includes 4097 electroencephalograms (EEG) readings per patient over 23.5 seconds, with 500 patients in total. The 4097 data points were then divided equally into 23 chunks per patient; each chunk is translated into one row in the dataset. Each row contains 178 readings that are turned into columns; in other words, there are 178 columns that make up one second of EEG readings. All in all, there are 11,500 rows and 180 columns with the first being patient ID and the last column containing the status of the patient, whether the patient is having a seizure or not.
In this project, I will demonstrate the steps to building a binary classification machine learning algorithm in Python.
The Jupyter Notebook is available on my Github.
The dataset contains a hashed patient ID column, 178 EEG readings over one second, and a Y output variable describing the status of the patient at that second. When a patient is having a seizure, y is denoted as 1 while all other numbers are other statuses we aren’t interested in. So when we turn our Y variable into a binary variable, this problem becomes a binary classification problem.
We will also choose to drop the first column since the patient id is hashed, and there’s no way for us to use that. We use the following code to do all of that.
The next step is to calculate the prevalence rate, and it is defined as the proportion of the samples that are positive in class; in other words, in our dataset, it is the proportion of patients that are having a seizure.
Our prevalence rate is 20%. This is useful to know when it comes to balancing classes and evaluating our model using the ‘lift’ metric.
Data Processing and Building Training/Validation/Test Sets
There isn’t any feature engineering to be done here since all of our features are numerical values of EEG readings; there is no processing needed to dump our dataset into our machine learning model.
It is good practice to separate the predictor and response variables from the dataset.
Now it’s time to split our dataset into training, validation, and testing sets! How exciting! Usually, validation and testing sets are of the same size, and the training sets typically range from 50% to 90% of the primary dataset, depending on the number of samples that the dataset has. The more samples a dataset has, the more samples we can afford to dump into our training set.
The first step is to shuffle our dataset to make sure that there isn’t some order associated with our samples.
Then, the chosen split is 70/15/15, so lets split our dataset that way. We will opt first to separate our validation and test sets apart from our training set, this is because we want our validation and testing sets to have similar distributions.
We can then check the prevalence in each set to make sure they’re roughly the same, so around 20%.
Next, we want to balance our dataset to avoid creating a model where it incorrectly classifies samples as belonging to the majority class; in our case, it would be patients not having a seizure. This is called the accuracy paradox, for example when the accuracy of our model tells us that we have an 80% accuracy, it will only be reflecting the underlying class distribution if the classes are unbalanced. Since our model sees that the majority of our samples are not having a seizure, the best thing to achieve a high accuracy score is to classify samples as not having seizures regardless of what we ask it to predict. There are two straightforward and beginner-friendly ways we can help combat this problem. Sub-sampling and over-sampling. We can sub-sample the more dominant class by reducing the number of samples belonging to the more dominant class, or we can over-sample by pasting the same samples of the minority class multiple times until both classes are equal in number. We will choose to use sub-sampling in this project.
We then save the
valid , and
test sets as .csv files. Before moving onto importing
sklearn and building our first model, we need to scale our variables for some of our models to work. Since we will be building nine different classification models, we should scale our variables with the
Let’s set it up, so we can print all of our model metrics with one function
And since we’ve balanced our data, let’s set out threshold at 0.5. The threshold is used to determine whether a sample gets classified as positive or negative. This is because our model returns the percentage chance of a sample belonging to the positive class, so it won’t be a binary classification without setting a threshold. If the percentage returned for the sample is higher than our threshold, then it will be classified as a positive sample, etc.
We will cover the following models:
- K Nearest Neighbors
- Logistic Regression
- Stochastic Gradient Descent
- Naive Bayes
- Decision Trees
- Random Forest
- Extreme Random Forest (ExtraTrees)
- Gradient Boosting
- Extreme Gradient Boosting (XGBoost)
We will use baseline default arguments for all models, then choose the model with the highest validation score to perform hyperparameter tuning.
K Nearest Neighbors (KNN)
KNN is one of the first models that people learn when it comes to
scikitlearn ‘s, classification models. The model classifies the sample based on the k samples that are closest to it. For example, if k = 3, and all three of the nearest samples are of the positive class, then the sample would be classified as class 1. If two out of the three nearest samples are of the positive class, then the sample would have a 66% probability to be classified as positive.
We get a pretty high training Area Under the Curve (AUC) Receiver Operator Curve (ROC), and a high validation AUC as well. This metric is used to measure the performance of classification models. AUC tells us how much the model is capable of distinguishing between classes, the higher the AUC, the better the model is at distinguishing between classes. If the AUC is 0.5, then you might as well guess at the samples.
Logistic regression is a type of generalized linear model, which are a generalization of the concepts and abilities of regular linear models.
In logistic regression, the model predicts if something is true or false, rather than predicting something continuous. The model fits a linear decision boundary for both classes, then is passed through a sigmoid function to transform from the log of odds to the probability that the sample belongs to the positive class. Because the model tries to find the best separation between the positive class and negative class, this model performs well when the data separation is noticeable. This is one of the models that require all features be scaled, and that the dependent variable is dichotomous.
Stochastic Gradient Descent
Gradient descent is an algorithm that minimizes many loss functions across many different models, such as linear regression, logistic regression, and clustering models. It is similar to logistic regression, where gradient descent is used to optimize the linear function. The difference is that stochastic gradient descent allows mini-batch learning, where the model uses multiple samples to take a single step instead of the whole dataset. It is especially useful where there are redundancies in the data, usually seen through clustering. SGD is therefore much faster than logistic regression.
The naive Bayes classifier uses the Bayes theorem to perform classification. It assumes that if all features are not related to each other, then the probability of seeing the features together are just the product of the probability of each feature happening. It finds the probability of the sample being classified as positive, given all the different combinations of features. The model is often flawed because the “naive” part of the model assumes all features are independent, and that’s not the case most of the time.
A decision tree is a model where it runs a sample down multiple “questions” to determine its class. The classifying algorithm works by repetitively separating data into sub-regions of the same class and the tree ends when the algorithm has divided all samples into categories that are pure, or by meeting some criteria of the classifier attributes.
Decision trees are weak learners, and by that, I mean they are not particularly accurate, and they often only do a bit better than randomly guessing. They also almost always overfit the training data.
Since decision trees are likely to overfit, the random forest was created to reduce that. Many decision trees make up a random forest model. A random forest consists of bootstrapping the dataset and using a random subset of features for each decision tree to reduce the correlation of each tree, hence reducing the probability of overfitting. We can measure how good a random forest is by using the “out-of-bag” data that weren’t used for any trees to test the model. Random forest is also almost always preferred over a decision tree since the model has a lower variance; hence, the model can generalize better.
Extremely Randomized Trees
The ExtraTrees Classifier is similar to Random Forest except:
- When choosing a variable at the split, samples are drawn from the entire training set rather than bootstrapping samples
- Node splits are selected at random, instead of being specified like in Random Forest
This makes the ExtraTrees Classifier less prone to overfit, and it can often produce a more generalized model than Random Forest.
Gradient boosting is another model that combats the overfitting of decision trees. However, there are some differences between GB and RF. Gradient boosting builds shorter trees, one at a time, and each new tree reduces the error the previous tree has made. The error is called the pseudo-residual. Gradient boosting is faster than a random forest, and are useful in lots of real-world applications. However, gradient boosting doesn’t do that well when your dataset contains noisy data.
Extreme Gradient Boosting
XGBoost is similar to gradient boosting except
- Trees have a varying number of terminal nodes
- Leaf weights of the trees that are calculated with less evidence are shrunk more heavily
- Newton Boosting provides a direct route to the minima than gradient descent
- Extra randomization parameter is used to reduce the correlation between trees
- Uses a more regularized model to control over-fitting since standard GBM has no regularization, which gives it better performance over GBM.
- XGB implements parallel processing and is much faster than GBM.
Model Selection and Validation
The next step is to visualize the performance of all of our models in one graph; it makes it easier to pick which one we want to tune. The metric I chose to evaluate my models is the AUC curve. You can choose any metric you want to optimize for, such as accuracy or lift, however, the AUC isn’t affected by the threshold you choose, so it’s a metric that most people use to evaluate their models.
Seven of the nine models have a very high performance, and this is most likely due to the extreme differences in EEG readings between a patient having a seizure and not having one. The decision tree looks like it overfitted as expected, notice the gap between the training AUC and the validation AUC.
I’m going to pick XGBoost and ExtraTrees classifier as the two models to tune.
Learning curves are a way for us to visualize the bias-variance tradeoff in our models. We make use of the learning curve code from
scikit-learn but plot the AUC instead since that’s the metric we chose to evaluate our models with.
Both the training and CV curves are high, so this signals both low variance and low bias in our ExtraTrees classifier.
However, if you see both curves having a low score and are similar, that’s a sign of high bias. If your curves have a big gap, that’s a sign of high variance.
Here are some tips on what to do in both scenarios:
– Increase model complexity
– Reduce regularization
– Change model architecture
– Add new features
– Add more samples
– Reduce the number of features
– Add/increase regularization
– Decrease model complexity
– Combine features
– Change model architecture
Just like in regression models, you can tell the magnitude of impact from feature coefficients; you can do the same in classification models.
According to your bias-variance diagnosis, you may choose to drop features or to come up with new variables by combining some, according to this graph. However, for my model, there is no need to do that. Technically speaking, EEG readings is the only feature that I have, and the more readings, the better the classification model will become.
The next step one should perform is to tune the knobs in our model, also known as hyperparameter tuning. There are several ways to do this.
This is a traditional technique for hyperparameter tuning, meaning that it was the first to be developed outside of manually tuning each hyperparameter. It requires all inputs of relevant hyperparameters (e.g., all the learning rates you want to test) and measures the performance of the model using cross-validation by going through all possible combinations of the hyperparameter values. The drawback to this method is that it would take a long time to evaluate when we have lots of hyperparameters we want to tune.
Random search uses random combinations of the hyperparameter to find the best performing model. You still need to input all values of the hyperparameters you want to tune, however the algorithm searches across the grid randomly, instead of searching all of the combinations of all values of the hyperparameters. This often beats grid search in terms of time due to its random nature where the model could reach its optimized value much sooner than grid search according to this paper.
Genetic programming or genetic algorithm (GA) is based on Charles Darwin’s theory of survival of the fittest. GA applies small, slow, and random changes to the current hyperparameters. It works by assigning a fitness value to a solution, the higher the fitness value, the higher the quality of the solution. It then selects the individuals with the highest fitness values and puts them into a “mating pool” where two individuals will generate two offspring (with some changes applied to the offspring), which is expected to have higher quality than their parents. This happens over and over until we get to the desired optimal value.
TPOT is an open source library under active development, first developed by researchers at the University of Pennsylvania. It takes multiple copies of the entire training dataset, and performs its own variation of one-hot encoding (if needed), then optimizes the hyperparameters using genetic algorithm.
We will use
dask with tpot’s automl to perform this. We pass
extratrees classifiers into the
tpot config to tell it we only want the algorithm to perform searches within these two classification models. We also tell
tpot to export every model made to a destination in case we want to stop it early.
The best performing model, with an AUC of 0.997, is the optimized extratrees classifier. Below is its performance on all three datasets.
We also create the ROC curve graph to show the above AUC curves.
Now, communicating the essential points of this project to a VP or CEO may often time be the hardest part of the job, so here is what I would say to a high-level stakeholder concisely.
In this project, we created a classification machine learning model that can predict whether patients are having a seizure or not through EEG readings. The best performing model has a lift metric of 4.3, meaning it is 4.3 times better than just randomly guessing. It is also 97.4% correct in predicting the positive classes in the test set. If this model was put into production to predict whether a patient is having a seizure, you could expect that performance in correctly predicting those who are having a seizure.
Thank you for reading!