Welcome to the World of Fine-Tuning Pre-Trained Models with Hugging Face Transformers
When it comes to training machine learning models for text classification tasks, starting from scratch can be time-consuming and resource-intensive. Luckily, with the power of pre-trained models and libraries like Hugging Face Transformers, we can fine-tune existing models to suit our specific needs efficiently and effectively.
Let’s walk through the process step-by-step:
To begin, we import the necessary libraries:
from datasets import DatasetDict, Dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification,
TrainingArguments, Trainer
import evaluate
import numpy as np
from transformers import DataCollatorWithPadding
Next, we load our training dataset, which contains 3,000 text-label pairs with a 70–15–15 train-test-validation split. You can find the original data here under an open database license.
dataset_dict = load_dataset("shawhin/phishing-site-classification")
Now, using the Transformer library, we can easily load and adapt pre-trained models. Let’s take a look at how this is done for the BERT model:
# define pre-trained model path
model_path = "google-bert/bert-base-uncased"# load model tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path)
# load model with binary classification head
id2label = {0: "Safe", 1: "Not Safe"}
label2id = {"Safe": 0, "Not Safe": 1}
model = AutoModelForSequenceClassification.from_pretrained(model_path,
num_labels=2,
id2label=id2label,
label2id=label2id,)
By now, our model is loaded, but we need to optimize it for training. One technique we’ll use is freezing most of the model parameters and only training the final layer and classification head. This step will significantly reduce computational costs.
# freeze all base model parameters
for name, param in model.base_model.named_parameters():
param.requires_grad = False# unfreeze base model pooling layers
for name, param in model.base_model.named_parameters():
if "pooler" in name:
param.requires_grad = True
Next, let’s preprocess our data. Tokenizing URLs and truncating them are the key operations here.
# define text preprocessing
def preprocess_function(examples):
# return tokenized text with truncation
return tokenizer(examples["text"], truncation=True)# preprocess all datasets
tokenized_data = dataset_dict.map(preprocess_function, batched=True)
Creating a data collator that dynamically pads token sequences in a batch during training for uniform length is crucial. This can be achieved with a single line of code.
# create data collator
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
Before training, defining a function to compute metrics like accuracy and AUC helps monitor progress. Here’s how it’s done:
# load metrics
accuracy = evaluate.load("accuracy")
auc_score = evaluate.load("roc_auc")def compute_metrics(eval_pred):
# get predictions
predictions, labels = eval_pred
# apply softmax to get probabilities
...
With hyperparameters defined and training arguments set, we move on to training the model.
# hyperparameters
...training_args = TrainingArguments(
output_dir="bert-phishing-classifier_teacher",
learning_rate=lr,
...
)
Now, we pass the training arguments into a trainer class and kick off the training process.
trainer = Trainer(
model=model,
args=training_args,
...
)trainer.train()
Once trained, we evaluate the model’s performance on independent validation data.
# apply model to validation dataset
predictions = trainer.predict(tokenized_data["validation"])# Extract the logits and labels from the predictions object
...
Congratulations! Your model has been fine-tuned and evaluated with promising results. Now, you’re ready to deploy it for real-world applications.
Don’t forget to check out more insights on Large Language Models (LLMs) below.