In a world where technology is advancing at a rapid pace, there are concerns about the risks associated with AI models and their impact on marginalized groups. These concerns are well-documented, highlighting how harmful connotations can reinforce societal stereotypes. For instance, the representation of demographic groups as animals or associating them with negative concepts can perpetuate long-standing negative narratives about these groups [4].
Generative AI models have the potential to learn and replicate these problematic associations if not carefully monitored [4]. This raises important questions about the ethical implications of deploying such models in various applications.
To address these concerns, various strategies can be implemented to fine-tune Language Model Models (LLMs) [6]. One common approach is Supervised Fine-Tuning (SFT), where the model is trained with specific datasets to optimize its performance based on desired outputs. This method helps the model learn to generate more appropriate and sensitive content.
Fine-tuning typically involves two phases – SFT for establishing a base model, followed by RLHF for enhanced performance. SFT focuses on imitating high-quality demonstration data, while RLHF refines LLMs through preference feedback [6].
RLHF offers two main methods – reward-based and reward-free. The former involves training a reward model to guide reinforcement learning algorithms, while the latter directly trains models on preference data. Methods like DPO have shown promising results in steering models away from problematic depictions [4].
These mitigation strategies are crucial post-deployment to ensure that the model operates ethically and responsibly, covering both user input prompts and the final generated images.
Prompt Filtering
One key aspect is prompt filtering, where harmful or inappropriate user requests are identified and blocked before processing. Methods like keyword matching and embedding-based CNN filters can help detect harmful content and prevent its generation [4].
LLMs stand out for their ability to understand context and intent, making them ideal for filtering out harmful content. By training LLMs to recognize and block specific types of content, organizations can ensure a more responsible use of AI technology.
Prompt Manipulations
Before generating images based on user prompts, various manipulations can be applied to enhance safety and reduce stereotypes. Prompt augmentation and anonymization techniques can help diversify results and protect user privacy [5].
By rewriting or grounding prompts, organizations can transform potentially harmful requests into neutral or positive ones, mitigating biases and stereotypes in the output images [5].
Output Image Classifiers
Deploying image classifiers can help identify and block harmful images generated by AI models. Multimodal classifiers that consider input images, prompts, and outputs can offer a more holistic approach to detecting unsafe transformations and unintended consequences [4].
Regeneration Instead of Refusals
Models like DALL·E 3 use a unique algorithm based on classifier guidance to improve unsolicited content, nudging the model towards more appropriate and safer generations [3]. This approach focuses on refining both the prompt and image classifier levels to ensure responsible output.