CODE-SWITCHING IN MULTILINGUAL DIGITAL TEXTS: AUTOMATED DETECTION AND LINGUISTIC PATTERNING THROUGH AI-BASED CORPUS ANALYSIS
DOI:
https://doi.org/10.63878/jalt1197Keywords:
Code-switching, multilingual NLP, corpus analysis, transformer models, syntactic boundaries, bilingual collocations, annotation reliability, AI-based language detection.Abstract
In this study, the authors explore whether machine learning-based corpus/pattern recognition of code switching in multilingual digital writings is feasible and to monitor the extent of code switching between these five languages in multilingual contexts (English-Hindi, English-Spanish, English-Arabic, English-Tagalog, and English-Malay). Based on a large-scale annotated corpus and transformer-based architecture fine-tuned on the multilingual setting, the study delivered token-level accuracies of above 95 percent and macro F1 scores of over 0.94 both in the in-domain and out-of-domain assessment. The analysis elicited consistent part-of-speech triggers where the nouns, verbs, discourse markers, were common occurrence at switch point, and usages of syntactic choices were realized by focusing on switch at noun phrase-verb phrase boundaries. The fact that high bilingual collocations ranks also demonstrated that formulaic expressions were robust indicators of switches. The reliability of annotation was verified using Cohen Kappa values that exceeded 0.86 and hyperparameter tuning showed that long-distance switching dependencies can be captured using longer sequence lengths. Not simply re-assuring the powerfulness of AI-guided models to reflect a human-level of code-switch detection, the results also contribute to theoretical knowledge in terms of structural, pragmatic, as well as sociolinguistic aspects of cross-linguistic contact when occurring in online contexts of communication.
Downloads
Published
Issue
Section
License

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.