Optimizing BERT for Text Classification | Shaw Talebi

Welcome to the World of Fine-Tuning Pre-Trained Models with Hugging Face Transformers

When it comes to training machine learning models for text classification tasks, starting from scratch can be time-consuming and resource-intensive. Luckily, with the power of pre-trained models and libraries like Hugging Face Transformers, we can fine-tune existing models to suit our specific needs efficiently and effectively.

Contents

Welcome to the World of Fine-Tuning Pre-Trained Models with Hugging Face Transformers Large Language Models (LLMs)

Let’s walk through the process step-by-step:

To begin, we import the necessary libraries:

    
      from datasets import DatasetDict, Dataset

      from transformers import AutoTokenizer, AutoModelForSequenceClassification, 
TrainingArguments, Trainer

      import evaluate

      import numpy as np

      from transformers import DataCollatorWithPadding

Next, we load our training dataset, which contains 3,000 text-label pairs with a 70–15–15 train-test-validation split. You can find the original data here under an open database license.

    
      dataset_dict = load_dataset("shawhin/phishing-site-classification")

Now, using the Transformer library, we can easily load and adapt pre-trained models. Let’s take a look at how this is done for the BERT model:

    
      # define pre-trained model path

      model_path = "google-bert/bert-base-uncased"# load model tokenizer

      tokenizer = AutoTokenizer.from_pretrained(model_path)
# load model with binary classification head

      id2label = {0: "Safe", 1: "Not Safe"}

      label2id = {"Safe": 0, "Not Safe": 1}

      model = AutoModelForSequenceClassification.from_pretrained(model_path, 

      num_labels=2, 

      id2label=id2label, 

      label2id=label2id,)

By now, our model is loaded, but we need to optimize it for training. One technique we’ll use is freezing most of the model parameters and only training the final layer and classification head. This step will significantly reduce computational costs.

    
      # freeze all base model parameters

      for name, param in model.base_model.named_parameters():

        param.requires_grad = False# unfreeze base model pooling layers

        for name, param in model.base_model.named_parameters():

          if "pooler" in name:

            param.requires_grad = True

Next, let’s preprocess our data. Tokenizing URLs and truncating them are the key operations here.

    
      # define text preprocessing

      def preprocess_function(examples):

        # return tokenized text with truncation

        return tokenizer(examples["text"], truncation=True)# preprocess all datasets

        tokenized_data = dataset_dict.map(preprocess_function, batched=True)

Creating a data collator that dynamically pads token sequences in a batch during training for uniform length is crucial. This can be achieved with a single line of code.

    
      # create data collator

      data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Before training, defining a function to compute metrics like accuracy and AUC helps monitor progress. Here’s how it’s done:

    
      # load metrics

      accuracy = evaluate.load("accuracy")

      auc_score = evaluate.load("roc_auc")def compute_metrics(eval_pred):

        # get predictions

        predictions, labels = eval_pred
# apply softmax to get probabilities

        ...

With hyperparameters defined and training arguments set, we move on to training the model.

    
      # hyperparameters

      ...training_args = TrainingArguments(

        output_dir="bert-phishing-classifier_teacher",

        learning_rate=lr,

        ...

      )

Now, we pass the training arguments into a trainer class and kick off the training process.

    
      trainer = Trainer(

        model=model,

        args=training_args,

        ...
      )trainer.train()

Once trained, we evaluate the model’s performance on independent validation data.

    
      # apply model to validation dataset

      predictions = trainer.predict(tokenized_data["validation"])# Extract the logits and labels from the predictions object

      ...

Congratulations! Your model has been fine-tuned and evaluated with promising results. Now, you’re ready to deploy it for real-world applications.

Don’t forget to check out more insights on Large Language Models (LLMs) below.

Introducing AI for customer service

Top Stories

Understanding Competitive Analysis: How to Conduct One

Top AI image generators of 2024: Tested and reviewed

Hyperparameter Tuning in R with Python: Step-by-Step Guide by Devashree Madhugiri

Optimizing BERT for Text Classification | Shaw Talebi | Oct 2024

Welcome to the World of Fine-Tuning Pre-Trained Models with Hugging Face Transformers

Large Language Models (LLMs)

Leave a Reply Cancel reply

Related Strories

AI for Diplomacy: Enhancing Strategy with Artificial Intelligence

Connecting DeepMind Research with Alphabet Products

56 Steps: Turning Data into Strategy for Complete Marketing Project

Asad Iqbal’s Hyperbolic Tangent (Tanh) Activation Function | Sep 2024

Quick Links

Follow Socials

Introducing AI for customer service

Top Stories

Understanding Competitive Analysis: How to Conduct One

Top AI image generators of 2024: Tested and reviewed

Hyperparameter Tuning in R with Python: Step-by-Step Guide by Devashree Madhugiri

Optimizing BERT for Text Classification | Shaw Talebi | Oct 2024

Welcome to the World of Fine-Tuning Pre-Trained Models with Hugging Face Transformers

Large Language Models (LLMs)

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

AI for Diplomacy: Enhancing Strategy with Artificial Intelligence

Connecting DeepMind Research with Alphabet Products

56 Steps: Turning Data into Strategy for Complete Marketing Project

Asad Iqbal’s Hyperbolic Tangent (Tanh) Activation Function | Sep 2024

Get Insider Tips and Tricks in Our Newsletter!