Updated Aug 2020 The approach used to discover the biases presented in this tool is described in the paper "Discovering and Categorising Language Biases in Reddit", accepted at the forthcoming International AAAI Conference on Web and Social Media (ICWSM 2021). Also, if you are interested in training your own embedding models to discover biases, we are sharing our code on Github.

Updated May 2021 A new version of the methodology to discover biases in language communities is described in the paper Discovering and Interpreting Conceptual Biases in Online Communities. We also developed a new Language Bias Visualiser with new options and analyses!

Language Bias Visualiser

The DADD (Discovering and Attesting Digital Discrimination) Language Bias Visualiser is a tool to interactively compare men and women stereotypes inherent in large textual datasets taken from the internet, as captured by Word Embeddings models.

Language carries implicit biases, functioning both as a reflection and a perpetuation of stereotypes that people carry with them. Recently, advances in Artificial Intelligence have made it possible to use machine learning techniques to trace linguistic biases. One of the most promising approaches in this field involves word embeddings, which transform text into high-dimensional vectors and capture semantic relations between words, and which has been successfully used to quantify human biases in large textual datasets. Target concepts such as `men' or `women' are connected to evaluative attributes found in the data, which are then categorised through clustering algorithms and labelled through a semantic analysis system into more general (conceptual) biases. Categorising biases allows us to give a broad picture of the biases present in a community.

+ show more

Dataset Selection

Find a link to the online community below:
https://reddit.com/r/dating_advice

Explore the data

Click on any of the cards and explore different interactive approaches to discover women and men stereotypes and gender-related biases found in the selected dataset.

Most Frequent Gender-Biased Words

Explore the most gender-biased and frequent words from the dataset. Most biased words are those who are often found in similar contexts as the women/men concepts.

Click to jump to the section

Detailed Dataset Word Biases

Explore the details of the most frequent and gender-biased words. This section shows the biases, frequency, polarity (sentiment) and part-of-speech (POS) of the most gender-biased words for women and men.

Click to jump to the section

Word Distributions of Biases

Explore the distribution of all gender-biased words in a bar graph, for women and men, ordered by bias or frequency

Click to jump to the section

Bias Polarity

Explore the sentiment of the most gender-biased words for women/men, classified in 7 categories ranging from positive to negative meanings.

Click to jump to the section

Word Embedding Space

Explore the distribution of women and men biased words in the embedding space, represented in the two principal t-SNE dimensions.

Click to jump to the section

Concept Embedding Space

Compare the concepts obtained from the embedding representations of the most gender-biased sets of words for women and men in the two principal t-SNE dimensions.

Click to jump to the section

Most Frequent Gender-Biased Words

The word clouds presented below show the most frequent words biased towards women (left) and men (right) in the selected dataset, that is, these words more often found in women and men-related contexts. The size and color of each word corresponds with its frequency, bigger means more frequent. For details about each word, see section Detailed Dataset Word Biases.

+ show more

Women

Men

Detailed Dataset Word Biases

This section shows the details of the most frequent and gender-biased words for women (left) and men (right) in the dataset.

+ show more

Nouns

Adjectives

Verbs

Adverbs

Articles

Women

#	Word	Bias	Freq	Sent	POS

Men

#	Word	Bias	Freq	Sent	POS

Word Distributions of Biases

Explore the bias and frequency distributions of all gender-biased words in the dataset in two bar plots; women-biased words are shown on the left bar plot and men-biased words on the right. By comparing the distributions, one could observe the differences between genders in the dataset. For instance, in The Red Pill, although men-biased words are more frequent, women-biased words hold stronger gender biases.

+ show more

Bias

Frequency

100

All

Bias Polarity

Explore the sentiment of the most gender-biased words for women (left) and men (right), classified in 7 categories ranging from positive to negative meanings.

+ show more

Bias

Frequency

Ignore neutrals

100

200

All

Word Embedding Space

Explore the distribution of women and men-biased words in the embedding space as learned by a machine learning algorithm, represented in the two principal t-SNE dimensions.

+ show more

Women
Men

Concept Embedding Space

The figure below shows the distribution of women and men concepts on the embedding space for the selected dataset, presented in the two most informative t-SNE dimensions.

A concept is formed by the aggregation of semantically similar words, based on the embedding representations learnt by the machine learning model. Concepts add another layer of abstraction between gender-biases and text, allowing a better understanding of the motives that drive gender-biases.

Similarly as before, semantically related concepts care found close in the figure, while semantically unrelated concepts are placed apart. The similarity between concepts is estimated with concept centroids. The size of each concept corresponds with its relative frequency in the dataset, bigger means more frequent. Women-biased concepts are shown in pink while men-biased concepts are presented in blue. By clicking on a concept, you can explore all words clustered in it together with their relative and absolute frequency. Notice how women and men-biased concepts are clearly separated in the embedding space, showing a clear separation between the set of most biased words identified with our methodology and the embedding representations learnt by the machine learning system.

Select the concept you want to explore, either by clicking on it or by selecting it using the buttons on the top right of the plot. After selecting a concept, all words clustered in it will appear in red on the right of the plot, showing the absolute and relative frequencies. Hovering on top of a concept will show the average bias, sentiment, and absolute frequency of all words clustered in the concept. This plot is based on LDAvis implementation.

The plot shows the 300 words most biased towards women and men genders, clustered in 40 clusters per gender based on word similarity in the embedding space.

+ show more