Creating a Classification Model: Introduction | Peter Mičko

B2. Data Preparation

Data preparation is a crucial step that has a significant impact on the model’s performance. The model’s performance is directly proportional to the data it is trained on. Many models rely on transformed input data, such as transforming continuous variables into values with the same range and encoding categorical variables into numerical values. The author describes 4 steps that are part of this phase:

Contents

B2. Data Preparation B2.1. Data Fusion B2.2. Data Cleaning B2.3. Feature Engineering B2.4. Data Splitting into Training and Testing Sets A3. Description of Models

Data Fusion: all sources are combined into one table based on common fields.
Feature Engineering: improving data quality by creating new variables (dimensionality reduction technique).
Data Cleaning: removing duplicate rows, replacing incorrect and missing values.
Data Splitting: normalizing, standardizing data, and splitting them into train and test datasets.

In this preparation phase, I used tools from the Pandas and Sci-kit learn packages. I will first describe the techniques theoretically. Their practical application in the project can be seen in the Jupyter notebook on my Github account.

B2.1. Data Fusion

Since I will use a pre-processed dataset for model training, I will skip the first preparation step of data fusion. The merge method pd.merge() is used for merging. More details can be found in the documentation.

The data include information such as employee satisfaction, performance rating, salary, department, time spent at work, etc. They are artificially generated and modified to contain “NaN” values to test data cleaning methods. For the same reason, a column is_smoker has been added.

B2.2. Data Cleaning

The next phase of data preparation for training is dealing with missing values. Models cannot be trained on incomplete samples. This can be achieved in several ways:

Removal of missing rows: not recommended due to loss of information and reduced efficiency of the trained model. For removal of missing values (labeled “NaN” in the dataset), the method df.dropna() is used.
Removal of entire column: for example, the column with the highest number of missing values among all variables. The command df.dropna(axis=1) removes all columns containing “NaN”. Simple removal of “NaN” values is not considered the best choice due to information loss and potential reduction in model performance. In my analysis, I removed the entire is_smoker column because 98.4% of the values were “NaN”.
Imputation of missing values with the mean/median value of a specific variable. This can be done using the method df[column_name].fillna(value= df[column_name].mean()). In my analysis, I replaced missing values in the time_spend_company column with the median value.
To replace “NaN” values in the average_monthly_hours column, I used values derived based on the number_project column.

B2.3. Feature Engineering

The next phase of data preparation involves converting non-numeric values (text, boolean values) to numeric types. This conversion is necessary before training the model. There are several ways to achieve this:

The .map() function: I manually mapped the target variable using a dictionary data type.
The LabelEncoder from the sci-kit learn package: first, a label encoder instance is created, trained, and then used to transform data using the fit_transform method. In this method, there may be an issue with encoding ordinal variables, meaning variables whose values are in a specific order and escalate. In the case of my dataset, for example, the salary column contains values “low”, “medium”, and “high”. LabelEncoder may assign the value “low” the number 2, which is not semantically equivalent (ideally it should be “low”=0, “medium”=1, “high”=2). In this case, it would be better to use the .map() function and define the values manually.
The One-hot encoding method: this method is used to encode nominal variables, meaning variables whose values do not have any order. A value can only be classified into a specific group or category. In my case, for example, the department column with values like “IT”, “hr”, “management”, “marketing”, etc. I used the pd.get_dummies(df) method to encode each of the unique values of a variable into a separate column where there are only 2 values, “True” (1) and “False” (0). This means that a specific observation either has or does not have this value. After encoding, it is good to display all columns and see how many new ones have been added. However, some ML models can be trained on datasets that also contain text variables. The necessary conversion happens “behind the scenes.”

B2.4. Data Splitting into Training and Testing Sets

The final step in data preparation is splitting the data into training and testing sets. This is done using the train_test_split function from the sci-kit learn library. The arguments to the function are the variable X (features) and y (target variable). The function divides the entire dataset into a training set, which contains X_train (training features) and y_train (training target values). The second part is the test set, which contains X_test (test features) and y_test (test target values). The ratio in which data is divided into training and testing can be set (usually 70/30). I will perform data splitting during model training.

A3. Description of Models

Classification models predict one of 2 classes: employee stayed/left (0/1). It is a binary type of classification, although there could also be multi-class classification.

I trained 3 classification models:

Support Vector Machines (SVM): the algorithm tries to find the best plane that separates observations into classes. The goal is to find the largest distance between the plane and the nearest samples of each class. This plane is a kind of boundary that determines the class to which an observation belongs. Linear and nonlinear dependencies can be modeled in a multidimensional space (parameter “kernel”).
K-Nearest Neighbours (KNN): an algorithm that determines the class membership of an observation based on its “k-nearest neighbors” in a space formed by input variables, where each dimension is represented by a variable. With 3 variables (3-dimensional space), it can be imagined as a sphere surrounding the sample for which we want to predict class membership. The size and shape of the sphere depend on the parameters set for the model.
Random Forest: a set of decision trees, where each tree is trained on a different subset of training data. Class membership is determined based on a series of decisions. These decisions (based on certain conditions) are modeled as a branching tree from top to bottom, where each decision creates a new branch with further decisions. The split should bring the greatest information gain. Classification occurs based on majority voting.

Introducing AI for customer service

Top Stories

SVD-based Product Recommender in Action

Top 6 Mac Antivirus Software Picks for 2024

Creating Your Roadmap to a Successful Data Science Career | TDS Editors | Sep, 2024

Creating a Classification Model: Introduction | Peter Mičko | Sep 2024

B2. Data Preparation

B2.1. Data Fusion

B2.2. Data Cleaning

B2.3. Feature Engineering

B2.4. Data Splitting into Training and Testing Sets

A3. Description of Models

Leave a Reply Cancel reply

Related Strories

Converting HEX Color Code to Monochrome Palette using Python | Dario Radečić | Sep, 2024

Neural Collaborative Filtering (NCF): Part 2 by Pritesh Sept 2024

Symbolic Regression for Noisy Time Series Data | Tim Forster | Sep, 2024

U-Net Paper Walkthrough: PyTorch Implementation in 2024

Quick Links

Follow Socials

Introducing AI for customer service

Top Stories

SVD-based Product Recommender in Action

Top 6 Mac Antivirus Software Picks for 2024

Creating Your Roadmap to a Successful Data Science Career | TDS Editors | Sep, 2024

Creating a Classification Model: Introduction | Peter Mičko | Sep 2024

B2. Data Preparation

B2.1. Data Fusion

B2.2. Data Cleaning

B2.3. Feature Engineering

B2.4. Data Splitting into Training and Testing Sets

A3. Description of Models

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Converting HEX Color Code to Monochrome Palette using Python | Dario Radečić | Sep, 2024

Neural Collaborative Filtering (NCF): Part 2 by Pritesh Sept 2024

Symbolic Regression for Noisy Time Series Data | Tim Forster | Sep, 2024

U-Net Paper Walkthrough: PyTorch Implementation in 2024

Get Insider Tips and Tricks in Our Newsletter!