Out of domain and cross-lingual part-of-speech tagging¶

The goal of this lab exercise is to build a part-of-speech tagger that should be tested in the two following settings:

out-of-domain generalization: test the tagger on a different domain than the one used for training (but in the same language).
cross-lingual generalization: test the tagger on a different language that has been used for training.

To this end, you will rely on the aligned fast-text embeddings: https://fasttext.cc/docs/en/aligned-vectors.html However, the original files are quite big, so I uploaded on the website a filtered version of them that contains only words data appears in the data we use. Two important points:

you must not fine tune these embeddings, they are fixed
you need to have a special embedding for unknown words (the ones that don't have embeddings in fasttext) that you initialize and fix to a vector full of 0 (e.g. add an unk word to your dictionnary)

To take care of this, the best way is to build the dictionnary when you read the fasttext embedding file, and then when you read the data, replace the words that don't have embeddings with the UNK word.

Warning: do not use any external library or tools to load the fast text embeddings. Do it yourself, it is just a few lines of Python.

Reading data¶

The data is in the conllu format: https://universaldependencies.org/format.html

Basically:

comments are lines starting with a #
a blank line separate sentences
the ID column can contain 3 types of values:
- a single number
- a "empty" token, these IDs contains a ".", for example 4.1 --- ignore these lines
- multiwords, these IDs contains a "-", for example 4-5 --- ignore these lines
you must convert all word to lowercase (we only have embeddings for lowercased words)

You must start by writting a function that reads a conllu files and returns the list of sentences and the list of part-of-speech tags (i.e. keep only columns "form" and "UPOS").

Warning: do not use any library to read these files, it is just a few lines of Python.

Dataset files:

in domain english data: en_ewt-ud-*.conllu
out of domain english data: en_pud-ud-test.conllu
French test data: fr_gsd-ud-test.conllu

Neural network¶

You must build a very simple neural network:

word embeddings from fast-text
a bilstm to construct context sensitive representations of words
to predict the POS of each word, use a very simple and shallow MLP (even a simple linear projection is sufficient) at each position

Importantly, word embeddings will be different when you test in the cross-lingual settings. Therefore, I strongly recommend you to have two separate modules:

one that retrieves word embeddings, that you instantiate two times (one time with English embeddings, one time with French embeddings) --- this also means that you need two dictionnaries that maps words to integers
one that does the rest of the computation

In order to correctly batch you data during training, you will need to use pack_padded_sequence and pad_packed_sequence (check the lecture slides). Explain in the report why you needs them in this case (and why you didn't need them in the language model lab exercise) and what they do.

Evaluation:¶

Returns the tagging accuracy, i.e. the number of correctly predicted tags.

If you have time, you can also explore more fine-grained metrics, especially in the cross-lingual case: accuracy per tag type, recall/precision/F1 per tag type, ...