Task 4: Low level NLP tools for Yoruba language

Yoruba Language Task Description:

The task is to develop low-level NLP tools for Standard Yorùbá.

Standard Yorùbá is language spoken by over 35 million people out of 200 million Nigerian population and in other countries like Benin, Togo and Ghana, Cote D’ivoire, Sudan and Sierra-Leone. Outside Africa, a great number of speakers of the language are in Brazil, Cuba, Haiti, Caribbean Islands, Trinidad and Tobago, UK and America.

It is among the under-resourced languages in the world i.e. languages for which limited digital resources exist; and thus, languages whose computerization poses unique challenges. These challenges include the non-availability of: electronic lexica, standardized electronic corpus, and NLP tools such as Part of Speech (POS) tagger, Multilevel Segment Tokenizer, Morphology Analyzer (MA).

For this task, we are providing medium data sets for Yorùbá in order to develop tokenizer, morphological analyzer and Automatic Language Identification system.

Multilevel Segment Tokenizer (NER) is defined as the task of decomposing stream of text into its units’ segment (i.e. phone, syllable and word level)

Morphology Analyzer (MA) is an important task of NLP used for quick and accurate analysis of text for automatic translation. The task involves breaking down of words into morphemes and grammatical constituents.

Automatic Language Identification (ALI) is a system that detect language that a document is written. ALI has a variety of applications e.g. Text Processing Techniques to real world data, Information Storage and Retrieval, Detection of language of a document for machine translation.

Sub tasks

The task has 3 sub-tasks –
4.A Tokenizer
4.B Morphology Analyzer

4 C Automatic Language Identification

Data

The link to data set for the task will be made available soon.

Evaluation Procedure

The standard evaluation metrics for evaluating and ranking the teams will be macro-averaged F1 scores.

Baseline

The simple probabilistic baseline (the most frequent tags get assigned to each token) will be provided by the organizers.

Important Dates

Training data set will be made available by 22nd April, 2019. Other deadlines are as per the workshop schedule.

Results

Results will be made available as per the workshop schedule

Paper submission

Paper submission instructions will be same as for the workshop

Task Organizers

If you have any queries regarding this task, please refer to the task organizers:

Adeyanju Sosimi asosimi@unilag.edu.ng
Sunday O. Ojo <OjoSO@tut.ac.za>