Task 7: Named Entity Recognition (NER) for Farsi

Task Description

Named Entity Recognition (NER) is defined as the task of identifying relevant nouns such as persons, organizations, locations, products, gens, etc., that are mentioned in the text.

NER is an important task employed as a primary step in the other tasks such as event detection from news, customer support for online shops, knowledge graph construction, and content recommendation.

Formally, Named Entities are referenced in a text by their name, indicated by a proper noun, noun phrase, pronoun, etc. On this basis, two sub-tasks are defined. The input is a raw text, and participating systems should produce an annotated text that highlight thenames of entities.

Sub tasks

Task 7.A. Farsi 3 Classes NER

The first sub task targets 3 basic classes including person, location and organization.

Task 7.B. Farsi 7 Classes NER

The second sub task aims 7 standard classes including person, location and organization, money, percent, dates, and time. Similar Tasks: Entity Linking.

Data

NER Methods usually combine rule-based and machine learning approaches. Therefore, a NER system needs two types of data:

1) corpus tagged with the named entities, and

2) gazetteers of named entities.

There is a medium size corpus with 7 classes of named entities, named PEYMA, in Farsi. This corpus contains more than 700 news documents. It has been prepared by Iran Telecommunication Research Center (ITRC).

A copy of the corpus can be downloaded here: http://www.parsigan.ir/projects/NER . You need an account to be able to get the corpus.

Of course, participating teams are allowed to use any other public resource such as Farsi Wikipedia, and they must completely describe them in their paper. A gazetteer of named entities including proper nouns and noun phrases indicating name of organizations and places is released as well.

Accuracy of the available Farsi NER systems may relate to the completeness of their gazetteers. Since the aim of the task is not to evaluate size of gazetteers, but also to evaluate the accuracy of the methods, a gazetteer is prepared and released as part the data required for this task and participating teams are not allowed to use gazetteers beyond that unless it is public or they agree to release it.

PEYMA is released as the training data. By now, test data is not available, however since the size of test data is much smaller than the training data, it will be developed and released according to the timeline of the shared task. Also, gazetteer will be prepared according to the timeline.

Training data format:

The training data files contain two columns separated by a tab. Each word has been put on a separate line and there is an empty line after each sentence. The first item on each line is a word, and the second named entity tag. The named entity tags have the format I-TYPE which means that the word is inside a phrase of type TYPE. The first word of each named entity have tag B-TYPE to show that it starts a new named entity. A word with tag O is not part of a named entity. Here is an example:

بهO
گزارشO
فردوسB-ORG
برينI-ORG
بهO
نقلO
ازO
گروهO
ديگرO
رسانه‌هايO
خبرگزاريB-ORG
فارسI-ORG
،O

Input Format:Similar to the training data, each word is in a separate line and empty line between sentences.
Output Format:Similar to the training data.
Receiving data:
In case of any problem in receiving data, please contact nsr.taghizadeh@ut.ac.ir

Evaluation Procedure

The standard evaluation metrics include precision, recall and F1. Precision is defined as the number of predicted entity name spans that line up exactly with spans in the gold data. Recall is defined similarly as the number of names in the gold data that appear at exactly the same location in the predictions.

Baseline

The simple baseline is a string matching algorithm which only looks for exact matches of list entries in the text.

Important Dates: http://nsurl.org/importantdates/

Task participation:

To participate in this task , the team leader has to do the following:

  1. Choose a name for your team (The name should reflect your team)
  2. login as an author to https://easychair.org/conferences/?conf=nsurl2019
  3. add the paper title: NSURL-2019 Task 7: Named Entity Recognition for Farsi
  4. Paper authors of the paper: The team members
  5. Paper abstract and keywords: add a simple tentative abstract that you can modify anytime
  6. submit

Results:

We list here the results of the participating teams after 30 June 2019.

Paper submission:

We list here instructions for paper submissions after 30.June 2019.

Task Organizers:

If you have any queries regarding this task, please refer to the task organizers:

Dr. Heshaam Faili <hfaili@ut.ac.ir >

Nasrin Taghizadeh <nsr.taghizadeh@ut.ac.ir>