DIALECT-DRIVEN ASR ERRORS: PHONETIC MISMATCH IN SOUTH ASIAN AMERICAN ENGLISH SPEECH

Authors

DOI:

https://doi.org/10.63878/jalt2124

Keywords:

Asian American English, ASR bias, phonetic variation, speech recognition errors, dialect mismatch, word error rate, linguistic equity, corpus and computational linguistics.

Abstract

ASR systems have reached almost human accuracy with Mainstream American English (MAE), but still make systematic errors on non-mainstream varieties. This paper examines how the ASR errors are formed in South Asian American English (SAAE), and it has been argued that the errors are due to a systematic discrepancy between the phonetic realizations of SAAE speakers and the acoustic-phonetic distributions coded into MAE-trained models, the Phonetic Mismatch Hypothesis. A convergent mixed-methods design was used and a controlled speech elicitation and quantitative analysis of error. The 40 SAAE speech samples were put together to form a corpus that reflects major segmental and suprasegmental aspects, such as variation in the quality of vowels, reduction of consonant clusters, epenthesis, and the presence of prosodic transfer. A pretrained Whisper ASR model was tested on reference transcriptions with the calculation of Word Error Rate (WER). A total of 170 errors were identified and classified as substitutions (82; 48.2%), deletions (52; 30.6%), and insertions (36; 21.2%). The speech of SAAE generated a WER of about 43, as opposed to a generation of about 6 by MAE speech, and there was a partial amelioration of the situation when the speech was generated under a fine-tuned adaptation condition (WER ≈ 18%). Types of errors were not randomly distributed among phonetic features: substitution errors were caused by vowel changes and consonant replacements; deletions were explained by the presence of consonant clusters; and most insertions were due to prosodic and rhythmic variation, specifically syllable-timed rhythm and epenthesis. These findings support the phonetic mismatch hypothesis that attributes errors in ASR to linguistic behaviors, and not failures in the system. This study contributes to a phonologically grounded description of ASR bias and proposes training and evaluation models to factor in dialect-specific phonetic knowledge.

References

Afkir, M., & Zellou, G. (2026). Phonological complexity, speech style, and individual differences influence ASR performance for Tarifit. Scientific Reports.

Amodei, D., Ananthanarayanan, S., Anubhai, R., Bai, J., Battenberg, E., Case, C., Casper, J., Catanzaro, B., Cheng, Q., Chen, G., Chen, J., et al. (2016). Deep speech 2: End-to-end speech recognition in English and Mandarin. Proceedings of the 33rd International Conference on Machine Learning, 173–182.

Chan, M. P. Y., Choe, J., Li, A., Chen, Y., Gao, X., & Holliday, N. R. (2022, September). Training and typological bias in ASR performance for world Englishes. In Interspeech (pp. 1273-1277).

Errattahi, R., El Hannani, A., & Ouahmane, H. (2018). Automatic speech recognition errors detection and correction: A review. Procedia Computer Science, 128, 32-37.

Flege, J. E. (1995). Second language speech learning: Theory, findings, and problems. In W. Strange (Ed.), Speech perception and linguistic experience: Issues in cross-language research (pp. 233–277). York Press.

Hannun, A., Case, C., Casper, J., Catanzaro, B., Diamos, G., Elsen, E., Prenger, R., Satheesh, S., Sengupta, S., Coates, A., & Ng, A. Y. (2014). Deep speech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567.

Hassan, M. A., Rehmat, A., Khan, M. U. G., & Yousaf, M. H. (2022). Improvement in automatic speech recognition of South Asian accent using transfer learning of DeepSpeech2. Mathematical Problems in Engineering, 2022, Article 6825555. https://doi.org/10.1155/2022/6825555

Jahan, M., Mazumdar, P., Thebaud, T., Hasegawa-Johnson, M., Villalba, J., Dehak, N., & Moro-Velazquez, L. (2025, April). Unveiling performance bias in ASR systems: A study on gender, age, accent, and more. In ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 1-5). IEEE.

Just, S. A., Elvevåg, B., Pandey, S., Nenchev, I., Bröcker, A. L., Montag, C., & Morgan, S. E. (2025). Moving beyond word error rate to evaluate automatic speech recognition in clinical samples: Lessons from research into schizophrenia-spectrum disorders. Psychiatry Research, 116690.

Koenecke, A., Nam, A., Lake, E., Nudell, J., Quartey, M., Mengesha, Z., Toups, C., Rickford, J. R., Jurafsky, D., & Goel, S. (2020). Racial disparities in automated speech recognition. Proceedings of the National Academy of Sciences, 117(14), 7684–7689. https://doi.org/10.1073/pnas.1915768117

Labov, W. (1994). Principles of linguistic change: Internal factors. Blackwell.

Lai, L. F., & Holliday, N. R. (2024). Voice quality variation in AAE: An additional challenge for addressing bias in ASR models?. In Interspeech.

Li, C., Cohen, T., & Pakhomov, S. (2024). Reexamining racial disparities in automatic speech recognition performance: The role of confounding by provenance. arXiv preprint arXiv:2407.13982.

Lippi-Green, R. (2012). English with an accent: Language, ideology, and discrimination in the United States (2nd ed.). Routledge.

Martin, J. L., & Tang, K. (2020, October). Understanding Racial Disparities in Automatic Speech Recognition: The Case of Habitual" be". In Interspeech (pp. 626-630).

McKenzie, R. M. (2010). The social psychology of English as a global language. Springer.

Munro, M. J., & Derwing, T. M. (1995). Foreign accent, comprehensibility, and intelligibility in the speech of second language learners. Language Learning, 45(1), 73–97. https://doi.org/10.1111/j.1467-1770.1995.tb00963.x

Mulholland, M., Lopez, M., Evanini, K., Loukina, A., & Qian, Y. (2016, March). A comparison of ASR and human errors for transcription of non-native spontaneous speech. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5855-5859). IEEE.

Ngueajio, M. K., & Washington, G. (2022, June). Hey ASR system! Why aren’t you more inclusive? Automatic speech recognition systems’ bias and proposed bias mitigation techniques. A literature review. In International Conference on Human-Computer Interaction (pp. 421-440). Cham: Springer Nature Switzerland.

Panayotov, V., Chen, G., Povey, D., & Khudanpur, S. (2015). LibriSpeech: An ASR corpus based on public domain audiobooks. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 5206–5210. https://doi.org/10.1109/ICASSP.2015.7178964

Pasandi, H. B., & Pasandi, H. B. (2022, November). Evaluation of asr systems for conversational speech: A linguistic perspective. In Proceedings of the 20th ACM Conference on Embedded Networked Sensor Systems (pp. 962-965).

Ravuri, S., & Stolcke, A. (2015). Recurrent neural network and LSTM models for lexical utterance classification. Proceedings of Interspeech, 135–139.

Russell, S. O. C., Gessinger, I., Krason, A., Vigliocco, G., & Harte, N. (2024). What automatic speech recognition can and cannot do for conversational speech transcription. Research Methods in Applied Linguistics, 3(3), 100163.

Sumner, M., Kim, S. K., King, E., & McGowan, K. B. (2014). The socially weighted encoding of spoken words: A dual-route approach to speech perception. Frontiers in Psychology, 4, Article 1015. https://doi.org/10.3389/fpsyg.2013.01015

Tobin, J., Nelson, P., MacDonald, B., Heywood, R., Cave, R., Seaver, K., ... & Green, J. R. (2024). Automatic speech recognition of conversational speech in individuals with disordered speech. Journal of Speech, Language, and Hearing Research, 67(11), 4176-4185.

Trudgill, P. (2000). Sociolinguistics: An introduction to language and society (4th ed.). Penguin.

Weiss, K., Khoshgoftaar, T. M., & Wang, D. (2016). A survey of transfer learning. Journal of Big Data, 3(1), 1–40. https://doi.org/10.1186/s40537-016-0043-6

Wassink, A. B., Gansen, C., & Bartholomew, I. (2022). Uneven success: automatic speech recognition and ethnicity-related dialects. Speech Communication, 140, 50-70.

Zhang, Y., Park, D. S., Han, W., Chiu, C. C., Zoph, B., Cubuk, E. D., & Le, Q. V. (2022). BigSSL: Exploring the frontier of large-scale semi-supervised learning for automatic speech recognition. IEEE Journal of Selected Topics in Signal Processing, 14(4), 732–745. https://arxiv.org/pdf/2109.13226

Downloads

Published

2026-04-29