Leveraging Machine Translation and Text-to-Speech Technologies for Disseminating Educational content in Native Nigerian languages.

PROJECT OVERVIEW

According to recent data [Unicef, 2022] from UNICEF, millions of adults and children in Nigeria, which forces many of them to engage in farming, commerce, and other types of vocations without any proper formal education. They are unable to learn and use cutting-edge technology in their occupation since they cannot afford to attend an institution of learning where they would be introduced to such methods. Besides, the majority of educational materials are in English, which further restricts access to knowledge. 

 

Despite the best efforts by government agencies and NGOs to get such people back into the classroom, we believe these efforts will take a long time, especially for adults if educational materials are not available in the native languages where they are fluent. Hence, we propose to provide instructional materials to such people in their native languages with the aid of speech and machine translation technologies. However, to use such technologies to their full potential, we must produce language corpora (such as parallel data for machine translation) that are particularly relevant to the area of interest or domains. Consequently, in this project, we propose the development of NiLE (Nigerian Language Processing for Educational Content) dataset.

 

The aim of this project is to curate parallel corpora for the three main Nigerian languages (Hausa, Igbo, and Yorùbá) that cover the two core disciplines or domains which are: agricultural science and civic education; with each discipline having 8,000, and in total 16,000 parallel sentences per language. Hausa, Igbo, and Yoruba are not only the three most spoken languages in Nigeria but also among the top ten most spoken languages in Africa, with over 30 million native speakers each. We chose those two disciplines because: (1) they are important in the UN’s sustainable development goals, (2) it is necessary to teach citizens their civic rights, duties, and responsibilities, and (3) it is essential to ensure that citizens, particularly those in rural areas, are aware of the most recent agricultural methods.

 

Additionally, in this project, we aim to record 10 hours of speech to adapt the recordings of education materials produced by native speakers of English (e.g. in the USA, Canada, or the UK) to Nigerian-English accent to fast-track the learning of educational materials (e.g. Agricultural podcasts) not produced in Nigeria. Similarly, we propose to adapt existing Text-to-Speech English models to Nigerian accents for better comprehension of read-out speech. We intend to deploy the translation and speech models created from the curated data at the conclusion of this project.

At the end of this project, we hope to have curated a novel parallel dataset covering two disciplines for the three major Nigerian languages, and also to have developed machine translation and text-to-speech systems that will allow Nigerian students, particularly children, to access educational content in their native languages, as well as release the curated data publicly to improve language research for Nigerian languages.

THEMATIC AREAS

LOCATION: Nigeria

FUNDING: $46,770

PRINCIPAL INVESTIGATOR: Mr. Kolawole Olatubosun