Keywords:
automatic POS tagging, natural language processing (NLP), colloquial speech, tagging difficulties, research prospectiveCopyright (c) 2024 Marta Garrote Salazar

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
Abstract
Part-of-speech (POS) tagging is a fundamental Natural Language Processing (NLP) task that involves assigning grammatical labels to each word in a given text. While POS tagging has been extensively studied in formal, well-structured language data, the accurate tagging of colloquial speech corpora presents unique challenges. This paper aims to explore the difficulties faced when employing POS tagging techniques on colloquial speech texts. We discuss the impact of dialects, slang, cultural and social context, and speech disfluencies on the accuracy of POS tagging. Furthermore, after a review of the state of the art, we identify potential solutions and future research directions to improve the performance of POS tagging in colloquial speech contexts.
Downloads
References
Alfared, Ramadan, y Denis Béchet (2012). «POS taggers and dependency parsing», International Journal of Computational Linguistics and Applications, 3 (2): 107-122.
Anbananthen, Kalaiarasi S. M., Jaya K. Krishnan, M. Shohel Sayeed y Praviny Muniapan (2017), «Comparison of stochastic and rule-based POS tagging on Malay online text», American Journal of Applied Sciences: 14 (9), 843-851.
Barr, Dale J., y Mandana Seyfeddinipur (2009), «The role of fillers in listener attributions for speaker disfluency», Language and Cognitive Processes, 25 (4): 441–455. DOI: 10.1080/01690960903047122.
Bolaños, Sergio (2015), «La lingüística de corpus: perspectivas para la investigación lingüística contemporánea», Forma y Función, 28 (1): 31-54. DOI: 10.15446/fyf.v28n1.51970.
Bonilla, Johnatan. E. (2024), «Spoken Spanish POS tagging: gold standard dataset», Language Resources and Evaluation: 1-30. DOI: 10.1007/s10579-024-09751-x.
Brill, Eric D. (1993), A corpus-based approach to language learning, Philadelphia, University of Pennsylvania.
Briz, Antonio (2016), «Español coloquial», en Javier Gutiérrez-Rexach (ed.) Enciclopedia de lingüística hispánica, vol. 2, Londres/Nueva York, Routledge: 463-476.
Castillo Velásquez, Francisco A., José Luis Martínez Godoy, María del Consuelo P. Torres Falcón, Jonny P. Zavala De Paz, Adela Becerra Chávez, y José A. Rizzo Sierra (2020), «Atribución de autoría de mensajes de Twitter a través del análisis sintáctico automático», Research in Computer Science, 149 (11): 91-101.
Cherradi, Mohamed, y Anass Haddadi (2024), «Exploration of scientific documents through unsupervised learning-based segmentation techniques», Seminars in Medical Writing and Education, 3: 1-9. DOI: 10.56294/mw202468.
Chiche, Alebachew, y Betselot Yitagesu (2022), «Part of speech tagging: a systematic review of deep learning and machine learning approaches», Journal of Big Data, 9: 1-25. DOI: 10.1186/s40537-022-00561-y.
Crible, Ludivine, Amandine Dumont, Lulia Grosman y Ingrid Notarrigo (2019), «(Dis)fluency across spoken and signed languages: spplication of an interoperable annotation scheme», en Liesbeth Degand, Gaëtanelle Gilquin, Laurence Meurant y Catherine Simon (eds.) Fluency and disfluency across languages and language varieties, Lovaina, Presses universitaires de Louvain: 17-40.
Farasyn, Melisa, Anne-Sophie Ghyselen, Jacques Van Keymeulen y Anne Breitbarth (2022), «Challenges in tagging and parsing spoken dialects of Dutch», Journal of Historical Syntax, 6 (4-11): 1-36.
Garrote, Marta (2010), Los corpus de habla infantil: metodología y análisis, Madrid, UAM Ediciones.
Ghosh, Soumitra, y Brojo Kishore Mishra (2020), «Parts-of-speech tagging in NPL: utility, types, and some popular POS taggers», en Brojo Kishore Mishra y Raghvendra Kumar (eds.), Natural language processing in artificial intelligence, Palm Bay, Apple Academic Press: 131-165.
Gupta, Aastha, Rachna Rajput, Richa Gupta, y Monika Arora (2014), «Improved POS tagging for unknown words», International Journal of Soft Computing and Engineering, 4: 47-50.
Jamatia, Anupam, Jjörn Gambäck y Amitava Das (2015), «Part-of-speech tagging for code-mixed English-Hindi Twitter and Facebook chat messages», Proceedings of Recent Advances in Natural Language Processing: 239–248.
Jørgensen, Anna K., Dirk Hovy y Anders Søgaard (2015), «Challenges of studying and processing dialects in social media», Proceedings of the Workshop on Noisy User-Generated Text: 9-18.
Kanade, Aditya, Petros Maniatis, Gogul Balakrishnan y Kensen Shi (2020), «Learning and evaluating contextual embedding of source code», International Conference on Machine Learning: 5110-5121.
Kumawat, Deepika, y Vinesh Jain (2015), «POS tagging approaches: a comparison», International Journal of Computer Applications, 118 (6): 32-38.
Landolsi, Mohamed Y., Lotfi Ben Romdhane, y Lobna Hlaoua (2024), «Hybrid medical named entity recognition using document structure and surrounding context», The Journal of Supercomputing, 80 (4): 5011-5041.
Martínez, Ángel R. (2012), «Part-of-speech tagging», Wiley Interdiciplinary Reviews: Computational Statistics, 4 (1): 107-113. DOI: 10.1002/wics.195.
Neunerdt, Melanie, Bianca Trevisan, Michael Reyer y Rudolf Mathar (2013), «Part-of-Speech tagging for social media texts», en Iryna Gurevych, Chris Biemann y Torsten Zesch (eds.) Language processing and knowledge in the web: lecture notes in computer science, Berlín/Heidelberg, Springer: 139-150. DOI: 10.1007/978-3-642-40722-2_15.
Rojo, Guillermo (2021), Introducción a la lingüística de corpus en español, Londres/Nueva York, Routledge.
Rozovskaya, Alla, Richard Sproat, y Elabbas Benmamoun (2006) «Challenges in processing colloquial Arabic», Proceedings of the International Conference on the Challenge of Arabic for NLP/MT: 4-14.
Sánchez-Cartagena, Víctor M., Juan Antonio Pérez-Ortiz, y Felipe Sánchez-Martínez (2024), «Understanding the effects of word-level linguistic annotations in under-resourced neural machine translation», arXiv: 2401.16078. DOI: 10.48550/arXiv.2401.16078.
Taulé, Mariona, M. Antonia Martí, Ann Bies, Montserrat Nofre, Aina Garí, Zhiyi Song, Stephanie Strassel, y Joe Ellis, J. (2015), «Spanish treebank annotation of informal non-standard web text», en Current Trends in Web Engineering: 15th International Conference, ICWE 2015 Workshops, Rotterdam, Springer International Publishing: 15-27.
Tintinago, Alfonso, Yordan Muñoz, Gustavo A. Uribe, y Pedro H. Álvarez (2018), «Etiquetado asistido de documentos de investigación mediante procesamiento de lenguaje natural y tecnologías de la web semántica», Scientia et Technica, 23 (4), 528-537.
Tsai, Yu-Fang, y Keh-Jiann Chen (2003), «Context-rule model for POS tagging», Proceedings of the 17th Pacific Asia Conference on Language, Information and Computation, ACL Anthology: 146-151.
Ying, Zelin, Chen Li, Yu Dong, Qiuqiang Kong, Qiao Tian, Yuanyuan Huo, y Yuxuan Wang (2024), «A unified front-end framework for English text-to-speech synthesis», IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP): 10181-10185.