Creating a Classification Model: Introduction | Peter Mičko | Sep 2024

SeniorTechInfo
8 Min Read

B2. Data Preparation

Data preparation is a crucial step that has a significant impact on the model’s performance. The model’s performance is directly proportional to the data it is trained on. Many models rely on transformed input data, such as transforming continuous variables into values with the same range and encoding categorical variables into numerical values. The author describes 4 steps that are part of this phase:

  • Data Fusion: all sources are combined into one table based on common fields.
  • Feature Engineering: improving data quality by creating new variables (dimensionality reduction technique).
  • Data Cleaning: removing duplicate rows, replacing incorrect and missing values.
  • Data Splitting: normalizing, standardizing data, and splitting them into train and test datasets.

In this preparation phase, I used tools from the Pandas and Sci-kit learn packages. I will first describe the techniques theoretically. Their practical application in the project can be seen in the Jupyter notebook on my Github account.

B2.1. Data Fusion

Since I will use a pre-processed dataset for model training, I will skip the first preparation step of data fusion. The merge method pd.merge() is used for merging. More details can be found in the documentation.

The data include information such as employee satisfaction, performance rating, salary, department, time spent at work, etc. They are artificially generated and modified to contain “NaN” values to test data cleaning methods. For the same reason, a column is_smoker has been added.

B2.2. Data Cleaning

The next phase of data preparation for training is dealing with missing values. Models cannot be trained on incomplete samples. This can be achieved in several ways:

  • Removal of missing rows: not recommended due to loss of information and reduced efficiency of the trained model. For removal of missing values (labeled “NaN” in the dataset), the method df.dropna() is used.
  • Removal of entire column: for example, the column with the highest number of missing values among all variables. The command df.dropna(axis=1) removes all columns containing “NaN”. Simple removal of “NaN” values is not considered the best choice due to information loss and potential reduction in model performance. In my analysis, I removed the entire is_smoker column because 98.4% of the values were “NaN”.
  • Imputation of missing values with the mean/median value of a specific variable. This can be done using the method df[column_name].fillna(value= df[column_name].mean()). In my analysis, I replaced missing values in the time_spend_company column with the median value.
  • To replace “NaN” values in the average_monthly_hours column, I used values derived based on the number_project column.

B2.3. Feature Engineering

The next phase of data preparation involves converting non-numeric values (text, boolean values) to numeric types. This conversion is necessary before training the model. There are several ways to achieve this:

  • The .map() function: I manually mapped the target variable using a dictionary data type.
  • The LabelEncoder from the sci-kit learn package: first, a label encoder instance is created, trained, and then used to transform data using the fit_transform method. In this method, there may be an issue with encoding ordinal variables, meaning variables whose values are in a specific order and escalate. In the case of my dataset, for example, the salary column contains values “low”, “medium”, and “high”. LabelEncoder may assign the value “low” the number 2, which is not semantically equivalent (ideally it should be “low”=0, “medium”=1, “high”=2). In this case, it would be better to use the .map() function and define the values manually.
  • The One-hot encoding method: this method is used to encode nominal variables, meaning variables whose values do not have any order. A value can only be classified into a specific group or category. In my case, for example, the department column with values like “IT”, “hr”, “management”, “marketing”, etc. I used the pd.get_dummies(df) method to encode each of the unique values of a variable into a separate column where there are only 2 values, “True” (1) and “False” (0). This means that a specific observation either has or does not have this value. After encoding, it is good to display all columns and see how many new ones have been added. However, some ML models can be trained on datasets that also contain text variables. The necessary conversion happens “behind the scenes.”

B2.4. Data Splitting into Training and Testing Sets

The final step in data preparation is splitting the data into training and testing sets. This is done using the train_test_split function from the sci-kit learn library. The arguments to the function are the variable X (features) and y (target variable). The function divides the entire dataset into a training set, which contains X_train (training features) and y_train (training target values). The second part is the test set, which contains X_test (test features) and y_test (test target values). The ratio in which data is divided into training and testing can be set (usually 70/30). I will perform data splitting during model training.

A3. Description of Models

Classification models predict one of 2 classes: employee stayed/left (0/1). It is a binary type of classification, although there could also be multi-class classification.

I trained 3 classification models:

  • Support Vector Machines (SVM): the algorithm tries to find the best plane that separates observations into classes. The goal is to find the largest distance between the plane and the nearest samples of each class. This plane is a kind of boundary that determines the class to which an observation belongs. Linear and nonlinear dependencies can be modeled in a multidimensional space (parameter “kernel”).
  • K-Nearest Neighbours (KNN): an algorithm that determines the class membership of an observation based on its “k-nearest neighbors” in a space formed by input variables, where each dimension is represented by a variable. With 3 variables (3-dimensional space), it can be imagined as a sphere surrounding the sample for which we want to predict class membership. The size and shape of the sphere depend on the parameters set for the model.
  • Random Forest: a set of decision trees, where each tree is trained on a different subset of training data. Class membership is determined based on a series of decisions. These decisions (based on certain conditions) are modeled as a branching tree from top to bottom, where each decision creates a new branch with further decisions. The split should bring the greatest information gain. Classification occurs based on majority voting.
Share This Article
Leave a comment

Leave a Reply

Your email address will not be published. Required fields are marked *