In The Space

Research and white papers

The are hundreds of smart people focused on issues related to AI trust and safety. The following is just a sample of some of the great work being done.

Model Development

Jan 2025: Open Problems in Machine Unlearning for AI Safety
Oct 2024: Language model developers should report train-test overlap

General

Liquid Foundation Models: Our First Series of Generative AI Models: Could Liquid Foundation Models (LFMs) replace Transformer architecture based models?
Building Socio-culturally Inclusive Stereotype Resources with Community Engagement
All that Agrees Is Not Gold: Evaluating Ground Truth Labels and Dialogue Content for Safety
A Framework to Assess (Dis)agreement Among Diverse Rater Groups
Ignore This Title and HackAPrompt: Exposing Systemic Vulnerabilities of LLMs through a Global Scale Prompt Hacking Competition
AURORA-M: The First Open Source Multilingual Language Model Red-teamed according to the U.S. Executive Order
LLM Agents can Autonomously Hack Websites
Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts
AEGIS: Online Adaptive AI Content Safety Moderation with Ensemble of LLM Experts
An Insider’s Guide to Designing and Operationalizing a Responsible AI Governance Framework
Evaluating Frontier Models for Dangerous Capabilities
Frontier Model Forum: What is Red Teaming?
Holistic Evaluation of Language Models
THE HISTORY AND RISKS OF REINFORCEMENT LEARNING AND HUMAN FEEDBACK