Using open web data to assess and improve non-binary gender operationalization at scale

Projektbeschreibung

Problem statement: Although there is a scientific consensus that gender is not binary and physiological, it is still common (even in research practices) to use binary categories. Gender identities or expressions of transgender and gender non-conforming people are often not recognized and masked behind accepted binary categorizations. While structuring and standardizing data is often associated with inevitable methodological challenges and some loss of context and meaning, it is critical to make sure that we use inclusive measures that can help us work on societal changes.

Project within the Fellowship: This feasibility study joins emerging efforts to establish inclusive gender operationalization by exploring current self-representations of gender in open web resources such as Wikipedia profiles and personal web pages. The study focuses on the film industry and stems from a larger research project (“Zirkulation,” funded by the German Ministry of Education and Research) that explores circulation of independent films. The task to tackle the problem of gender equality in the film industry gained some momentum in the recent years, yet most studies are still limited to binary operationalizations. The study uses a sample of directors and cast of 1727 films selected from a sample of six relevant international film festivals. This case study exemplifies a common situation, when using a survey instrument to collect high quality self-reported data on gender is highly difficult due to financial, time, and practical constraints. Therefore, one often uses automatic detection methods, that are primarily based on names and are explicitly limited to binary categories. In contrast to these traditional methods, this project focuses on organic web data of how people choose to represent their gender - for example considering their use of personal pronouns and various cues of self-identification with non-binary gender. Instead of assigning a gender category top-down, the approach is to identify and annotate language used to represent gender identities. These findings are then compared with the name-based manual and automatic (Gender Guesser in Python and Genderize in R) assignment of gender.

Manual annotations of language cues identified in the project can be later used to work on improved automatic detection methods as well as to inform the way gender data are standardized and structured in the open linked data projects (e.g., Wikidata).

Interim Report (Zwischenbericht)

Autor/in

Name: Zhenya (Evgenia) Samoilova, PhD
Institution: Film University Babelsberg Konrad Wolf
Kontakt: e.samoilova@filmuniversitaet.de