The task is to develop low-level NLP tools for Magahi which is Eastern Indo-Aryan languages spoken largely in the Eastern states of Bihar, Jharkhand and Uttar Pradesh in India. This language is part of what is considered a dialect continuum running the Eastern part of India to its Weatern part and consisting of approximately 50 languages / varieties. Hindi, the official language of India, is part of the same continuum and as such these are closely related to each other. However, despite this similarity, the language has large divergences in terms of lexicon as well as morphological make-up. As such most of the tools developed for Hindi do not perform very well with the other languages. For this task, we are providing small annotated data sets for Magahi in order to develop part-of-speech tagger and morphological analyser. The data set is annotated with the part of speech categories and morphological features from Universal Dependencies tag set.
The task has 2 sub-tasks –
9.A POS-tagger for each language
9.B Number, Gender, Person, Tense, Aspect, Honorificity and Case relation analyzer
We will provide 5,000 annotated sentences (in CONLL-U format) . In addition to this, participants are also encouraged to use the Hindi data set available with Universal Dependencies project. Additionally they are free to use any other data set as long as the data set is freely available for research
The standard evaluation metrics for evaluating and ranking the teams will be macro-averaged F1 scores.
The simple probabilistic baseline (the most frequent tags get assigned to each token) will be provided by the organizers.
Training data set will be made available by 30. April, 2019. Other deadlines are as per the workshop schedule.
Results will be made available as per the workshop schedule
Paper submission instructions will be same as for the workshop
If you have any queries regarding this task, please refer to the task organizers: