To Spot Toxic Speech Online, Try AI

A new tool helps balance accuracy with fairness toward all groups in social media

Based on the research of Maria De-Arteaga

Futuristic Robotic Hand Holding a Glowing Warning Sign in Cyber Space

Earlier this year, Facebook rolled back rules against some hate speech and abuse. Along with changes at X (formerly Twitter) that followed its purchase by Elon Musk, the shifts make it harder for social media users to avoid encountering toxic speech.

That doesn’t mean that social networks and other online spaces have given up on the massive challenge of moderating content to protect users. One novel approach relies on artificial intelligence. AI screening tools can analyze content on large scales while sparing human screeners the trauma of constant exposure to toxic speech.

But AI content moderation faces a challenge, says Maria De-Arteaga, assistant professor of information, risk, and operations management at Texas McCombs: being fair as well as being accurate. An algorithm may be accurate at detecting toxic speech overall, but it may not detect it equally well across all groups of people and all social contexts.

“If I just look at overall performance, I may say, oh, this model is performing really well, even though it may always be giving me the wrong answer for a small group,” she says. For example, it might better detect speech that’s offensive to one ethnic group than to another.

In new research, De-Arteaga and her co-authors show it’s possible to achieve high levels of both accuracy and fairness. What’s more, they devise an algorithm that helps stakeholders balance both, finding desirable combinations of accuracy and fairness for their particular situations.

With professor Matthew Lease and graduate students Soumyajit Gupta and Anubrata Das of UT’s School of Information, as well as Venelin Kovatchev of the University of Birmingham, United Kingdom, De-Arteaga worked with datasets of social media posts already rated “toxic” and “nontoxic” or safe by previous researchers. The sets totaled 114,000 posts.

The researchers used a fairness measurement called Group Accuracy Parity (GAP), along with formulas that helped train a machine learning model to balance fairness with accuracy. Applying their approach through AI to analyze the datasets:

  • It performed up to 1.5% better than the next-best approaches for treating all groups fairly.
  • It performed the best at maximizing both fairness and accuracy at the same time.

But GAP is not a one-size-fits-all solution for fairness, De-Arteaga notes. Different measures of fairness may be relevant for different stakeholders. The kinds of data needed to train the systems depends partly on the specific groups and contexts for which they’re being applied.

For example, different groups may have different opinions on what speech is toxic. In addition, standards on toxic speech can evolve over time.

Getting such nuances wrong could wrongly remove someone from a social space by mislabeling nontoxic speech as toxic. At the other extreme, missteps could expose more people to hateful speech.

The challenge is compounded for platforms like Facebook and X, which have global presences and serve wide spectrums of users.

“How do you incorporate fairness considerations in the design of the data and the algorithm in a way that is not just centered on what is relevant in the U.S.?” De-Arteaga says.

For that reason, the algorithms may require continual updating, and designers may need to adapt them to the circumstances and kinds of content they’re moderating, she says. To facilitate that, the researchers have made GAP’s code publicly available.

High levels of both fairness and accuracy are achievable, De-Arteaga says, if designers pay attention to both technical and cultural contexts.

“You need to care, and you need to have knowledge that is interdisciplinary,” she says. “You really need to take those considerations into account.”

Finding Pareto Trade-Offs in Fair and Accurate Detection of Toxic Speech” is published in Information Research.

Story by Omar Gallaga