Academy of Finland
Funding decision
Applicant / Contact person Laippala, Veronika
Organisation University of Turku
Project title
Massively multilingual modeling of registers in web-scale corpora
Decision No. 331297
Decision date 28.05.2020
Funding period 01.09.2020 - 31.08.2024
Funding 480 000
WebFOCUS Report
Project description
This project combines the long traditions of corpus linguistics and the latest innovations of natural language processing (NLP) to explore web registers—situationally defined Internet text varieties such as news, blogs or how-to pages—on a massively multilingual scale. Specifically, the project 1) analyzes language-specific differences of registers and creates a data-driven description of the full range of web registers in six languages, 2) develops machine learning methods for the large-scale modeling of registers and their identification in massively cross/multi-lingual settings, and 3) automatically identifies registers in Universal Parsebanks, a language resource spanning 100 billion words and 64 languages. Thereby, the project provides critical knowledge about online communication and methods with which to develop web data from simple masses of raw, unstructured text toward organized resources with rich contextual information.