Title: "Pitch, resonance and their contributions to listener perceptions of speaker gender"
Supervisors: Prof. Jennifer Oates (La Trobe University), Dr Viktoria Papp (University of Canterbury), Dr Siew Pang Chan (National University of Singapore).
Abstract: Transsexual individuals frequently display voice and communication characteristics which are incongruent with their experienced gender. This often leads to reduced participation in everyday life and reduced quality-of-life (Hancock, Krissinger & Owen, 2011). Voice modifications via hormones, surgery or voice training can be effective in creating a communicative characteristics congruent with the desired gender (McNeill, 2006). This can alleviate a transsexual individual’s gender dysphoria, which is the feeling of distress as a result of the marked difference between their experienced and birth-assigned gender (T’Sjoen, 2013).
The aim of voice training is to help the individual adopt communication characteristics congruent with their self-identified gender, thus alleviating the dysphoria. Goal setting for voice therapy is based on aspects of speech and voice most salient in listener perceptions of speaker gender (Oates, 2012). Leung, Oates and Chan (2018) systematically reviewed the literature on such aspects and found that significant risks of bias exist in the body of research on contributors to gender perception rendering conclusions about each aspects’ contribution unclear.
In the proposed studies, aspects of speaking fundamental frequency and formant characteristics, which were identified in the review above, are examined for their relative contributions to gender perception. Research will address the risks of bias by first developing normative values for speakers of Australian English on the two aspects and then establishing associations between these aspects and gender perception in exploratory and experimental research designs. The AusTalk corpus will be used as the main source of audio files of speakers of Australian English.
Title: “Studying Perceptual Dimensions of Non-experts by Assessing Speaker Differences"
Benjamin Weiss, Dominique Estival, Ulrike Stiefelhagen, (2018). ‘Non-Experts' Perceptual Dimensions of Voice Assessed by Using Direct Comparisons’. Acta Acustica united with Acustica. Volume 104, Number 1, pp. 174-184(11). DOI: doi.org/10.3813/AAA.919157.
Abstract: In this study, three data sets of 13 speakers each are analyzed using the elicitation phase of the Repertory Grid Technique in order to identify perceptual dimensions of non-expert listeners. Sentences read by female and male speakers of German (Phondat 1 corpus) and by male Australian English speakers (AusTalk corpus) were rated on (dis)similarity by same-sex listeners using triples. Applying a balanced incomplete design proposed for the Repertory Grid Technique, frequencies of dissimilar pairs are transformed into distance measures using non-metrical multidimensional scaling. For both German data sets, three dimensions are found, and four for the Australian data. The dimensions describing the speaker differences are named ‘calmness’, ‘monotony’ and ‘naturalness’ for the German men; ‘tension’, ‘positive timbre’ and ‘proficiency’ for the German women; and ‘pitch’, ‘untypical timbre & voice’, ‘emotion’ and an unnamed fourth dimension for the Australian men. There are similarities between dimensions from the different data sets (‘calmness’ and ‘tension’; ‘naturalness’ and ‘proficiency’; ‘timbre’ and ‘timbre & voice’). Calmness and skill have also been found in a smaller, earlier experiment applying the same method for German speakers. Overall, the dimensions found are more complex and speaker-related than typically described in the literature. These results add to the current state of research in perceptual dimensions of non-experts and represent the foundation to develop a questionnaire for assessing listeners’ impressions.
Title: "Conversational Australian English - Analysing Speech Acts in AusTalk Map Tasks"
Trans-Disciplinary Innovation Grant from the Centre of Excellence for the Dynamics of Language (2015-2016).
Abstract: This pilot project will identify new ways of mining large speech corpora for specific speech acts (SAs), such as questions, requests for information, and expressions of surprise or agreement/disagreement. It will add to our understanding of how specific SAs are expressed and to our knowledge of the special features of Australian English, especially in conversational speech. As one of the first explorations of the AusTalk corpus using the Alveo virtual laboratory, it will test its potential for language and digital humanities research and lay the grounds for automating the annotation and identification of SAs.
Title: “Synthesizing Speech using the AusTalk Corpus"
Paper presented at SST 2014. Zhijie Shao, Richard E. Leibbrandt and Trent W. Lewis
Abstract: Speech Synthesis, also called text-to-speech (TTS), is the task of producing speech (acoustic waveform) from text. It has been widely used in various domains. Though diverse accented English have been synthesized, a freely available Australian accented English has not been well created. Therefore, the project tries to synthesize an Australian accented voice using the AusTalk corpus. The project adopted MARY (Modular Architecture for Research on speech sYnthesis), a speech synthesis system that includes useful auxiliary functions such as Voice Import Tool and Emotion Markup Language Support for creating new voices. The 59 Sentence component from one speaker of the AusTalk corpus was selected as the database for building a voice. Preliminary, subjective evaluation by the researchers has suggested that in some cases the sound quality of the AusTalk based voice was comparable to a larger, high quality Blizzard-based voice. Accuracy of the automatic phoneme alignment can affect to quality of the voice, hand aligned will be used in the future as well as incorporating other components of the protocol. A more systematic user evaluation is planned for the future.
Title: "Comparing acoustic analyses of Australian English vowels from Sydney: Cox (2006) versus AusTalk"
Paper presented at SST 2014. Jaydene Elvin & Paola Escudero
Abstract: This study presents a comparison of the acoustic properties of Australian English monophthongs produced by 60 monolingual females from Sydney’s Northern Beaches reported in Cox’s corpus and by the four monolingual females from Sydney recorded within the AusTalk corpus. Cross-corpus discriminant analyses are used to investigate the acoustic similarity between the two corpora to determine whether the values from these corpora would be appropriate for predicting L2 difficulty in future cross-linguistic studies using Western Sydney speakers. Preliminary findings suggest that there is little overall acoustic similarity across these two vowel corpora as classification scores from the discriminant analyses were consistently higher for the Cox corpus than AusTalk. In particular, greatest variation between the two corpora is observed in their productions of front vowels. Limitations for drawing conclusions based on the current data are provided and the need for an additional corpus of Australian English vowels from speakers in Western Sydney for future cross-linguistic studies is proposed.
Title: "Why the SQUARE vowel is the most variable in Sydney"
Paper presented at SST 2014. Nhung Nguyen, Jason A. Shaw
Abstract: Vowel variability is often explained in terms of linguistic and social factors. We have observed another factor that predicts vowel variability. Within four different corpora of Australian English vowels, we find a consistent relationship between the mean and standard deviation of formant values. For both F1 and F2, increases in mean formant values go hand in hand with increased variability. Given this observation, we propose that inferences about vowel variability take the mean formant values into account. Doing so changes conclusions about which vowels are most variable, undergoing change, or likely to reflect meaningful social variation.
Title: "Multimodal Speech Recognition with the AusTalk 3D Audio-Visual Corpus"
Tutorial at Interspeech 2014. Roberto Togneri, Mohammed Bennamoun, Chao Sui
Abstract: This tutorial will provide attendees a brief overview of 3D based AVSR research. In this tutorial, attendees will learn how to use the newly developed 3D based audio visual data corpus we derived from the AusTalk corpus for audio-visual speech/speaker recognition. In addition, we also plan to introduce some results using this newly developed 3D audio-visual data corpus, which show that there is a significant speech accuracy increase by integrating both depth-level and grey-level visual features. In the first part of the tutorial, we will review some recent works published in the last decade, so that attendees can obtain an overview of the fundamental concepts and challenges in this field. In the second part of the tutorial, we will briefly describe the recording protocol and contents of the 3D data corpus, and show attendees how to use this corpus for their own research. In the third part of this tutorial, we will present our results using the 3D data corpus. The experimental results show that, compared with the conventional AVSR based on the audio and grey-level visual features, the integration of grey and depth visual information can boost the AVSR accuracy significantly. Moreover, we will also experimentally explain why adding depth information can benefit the standard AVSR systems. Eventually, through our tutorial, we hope we can inspire more researchers in the community to contribute to this exciting research.
Title: “Australian English Dialect Perceptions"
Abstract: Australians often think they can identify where people come from by their accents, and they also often have stereotypes associated with particular ways of talking. This study explores language attitudes across Australia and will make use of the AusTalk database in this research. Using the methodology of perceptual dialectology, non-expert native Australian English speakers are asked a series of questions through an online survey regarding social, regional, and cultural differences in Australian English. These respondents identify dialect differences through mental map tasks by identifying and describing differences on a national and a local map of Australia. They are then asked to rate regional and remoteness (urban/rural) differences across Australia through six scales activities, examining perceptions of ‘degree of difference’, ‘correctness’, ‘pleasantness’, ‘broadness’, and ‘speed of speech’. Finally, respondents locate and describe 16 voice samples from the AusTalk corpus. The voice samples include two young-adult speakers, one male and one female from each state and territory. Respondents are asked to identify where each speaker is from, and to rate their accent on a number of qualitative scales. Preliminary findings show native Australian English speakers can identify regional and social differences in Australian English across Australia. In particular, regions noted as distinct, locally known, or culturally stereotyped garner more frequent and stronger associations. Further data will explore the accuracy of voice identification, and the role voice identification plays in non-expert attitudes toward dialect perceptions.