Explainable AI (XAI) aims to make AI/ML decisions more transparent and understandable. In this project, I focus on a practical application of model explainability in the context of hate speech detection. This approach can be widened to include other tasks as well.
The primary goal is to classify whether a given sentence is hate speech/offensive. While classification alone can provide an answer, it often falls short in providing the reasoning behind the decision. To bridge this gap, I have implemented a few methods to generate explanations for the classification results –
1) Token Classification Task: This approach involves identifying specific hate speech words within the sentence. By highlighting these words, users can gain an understanding of what elements of the sentence contributed to the classification.
2) Local Interpretable Model-agnostic Explanations (LIME): LIME works by approximating the model locally with an interpretable model, offering insights into which features (words or phrases) that are influential in the model’s decision-making process.
The models that I have used are trained on publicly available data.
The models are trained on one A10 GPU and uploaded on HuggingFace, the accuracy of the models can still be increased. Weighted loss was used to train the NER model and classification models (BERT and RF) standard cross-entropy loss.
I still believe there is room for improvement in these models merely by tuning hyperparameters. Let’s dive in!
When it comes to model explainability, feature-based models are often easier to visualize and understand. Simple models, such as Naive Bayes classifiers and Random Forests, can effectively perform classification or regression tasks. I trained a Random Forest Classifier to categorize input text into three categories: Normal, Hate Speech, and Offensive. By using TF-IDF to vectorize the input, the model achieved a good level of accuracy.
However, the main drawback of these simpler models is their limited understanding of context. Feature-based models naively interpret context and can be easily misled. They struggle to grasp complex contextual nuances, which is why there is a need for more sophisticated models, like transformers. Transformers leverage advanced techniques such as self-attention and contextual embeddings to better understand and process intricate contextual information.
The model assigns a 62% probability to the “hatespeech” category, 30% to “offensive,” and 9% to “normal,” indicating it recognizes the sentence as problematic but lacks full confidence in identifying it as hate speech. Notably, the model identifies words like “ching” and “chong” as contributing to the “NOT normal” category, yet it underestimates their significance. This highlights the model’s naive interpretation of the context.
I trained BERT for the classification task. Identifying hate speech words requires a deep understanding of the sentence context. BERT excels at this due to its positional encoding and self-attention mechanisms, which allow it to grasp the context of the sentence more effectively and provide coherent explanations, unlike the Random Forest model, which can sometimes generate nonsensical explanations.
This highlights the difference in how these models understand context. The Random Forest model interprets context in a more naive manner compared to the BERT model, which utilizes advanced techniques to capture complex contextual relationships.