From project data to language corpus: an online guide to opening linguistic recordings for further analysis[Bearbeiten]

Abstract[Bearbeiten]

This project aims to create a practical resource that will enable linguists and other researchers who gather speech data, such as recordings of stories, monologues, and conversations, to make their project data open and thus reusable for further analysis. Speech recordings are rich in information – natural language combines sound, sentence structure and meaning in complex ways – yet researchers cannot examine all issues of interest in the course of a single project. At the same time, the collection and annotation of language data is time-consuming and costly, and if data remain local and project-bound, research potential goes untapped.

Providing a step-by-step guide to creating freely accessible linguistic corpora will make it easier for researchers to open their datasets to be used to their fullest potential.

The planned guide will cover the essentials of the process from start to finish, from participant consent forms to technical implementation to annotation issues. To accompany the guide, a fully functional sample corpus of authentic speech data will be created to illustrate individual steps and challenges in the process. The materials will be made available online and at Wikiversity, enabling them to be used by other researchers, in formal education settings and for self-study by the general public.

Projektbeschreibung[Bearbeiten]

In linguistic research, a large amount of speech data is often collected to answer a very specific question. Such data include acoustic recordings of speech, but also data on processes related to speaking, such as breathing cycles or movements of the speech articulators (e.g., tongue, lips). Many of these datasets are excellent sources for secondary analysis. In other words, they could be analyzed in light of many further research questions in other areas of linguistics: for example, while some linguists are interested in understanding the particular acoustic cues that signal a certain meaning in a conversation, others want to know how sentence structure gives rise to meaning, and still others investigate language use among certain groups of people. In each of these cases, researchers collect a wealth of data, yet its full potential is rarely exploited: the datasets often remain project-specific and local to the researcher or research group.

The goal of this project is thus to produce a practical guide to making such data open and available to other researchers who wish to investigate further aspects. In the course of my PhD project, I am creating a collection of recordings of acoustic, respiratory and motion capture data to better understand how people adapt their speech behavior to physical activity. This entails recording spontaneous speech that lends itself to further analysis beyond the scope of my research question. During the fellowship, I would like to develop a practical hands-on guide to making project-based data usable for a broader audience.

The guide will cover the essential points of making language recordings open, including ethics and consent of participants, issues related to conducting secondary analysis, annotation techniques and use of open-source software for analysis. The guide will be freely available online and will be illustrated by examples from the dataset. To promote awareness and generate feedback, a 1-day workshop will be held in Berlin and the resulting guide and sample corpus will be made available on Wikiversity.

Zwischenbericht[Bearbeiten]

Dieser Bericht beschreibt den Stand des Projektes im Januar 2020.

Autor/in[Bearbeiten]

Name: Heather Weston
Institution: Leibniz-Zentrum Allgemeine Sprachwissenschaft, Berlin
Kontakt: weston@leibniz-zas.de