The goal of this lab exercise is to build a part-of-speech tagger that should be tested in the two following settings:
To this end, you will rely on the aligned fast-text embeddings: https://fasttext.cc/docs/en/aligned-vectors.html However, the original files are quite big, so I uploaded on the website a filtered version of them that contains only words data appears in the data we use. Two important points:
To take care of this, the best way is to build the dictionnary when you read the fasttext embedding file, and then when you read the data, replace the words that don't have embeddings with the UNK word.
Warning: do not use any external library or tools to load the fast text embeddings. Do it yourself, it is just a few lines of Python.
The data is in the conllu format: https://universaldependencies.org/format.html
Basically:
You must start by writting a function that reads a conllu files and returns the list of sentences and the list of part-of-speech tags (i.e. keep only columns "form" and "UPOS").
Warning: do not use any library to read these files, it is just a few lines of Python.
Dataset files:
You must build a very simple neural network:
Importantly, word embeddings will be different when you test in the cross-lingual settings. Therefore, I strongly recommend you to have two separate modules:
In order to correctly batch you data during training, you will need to use pack_padded_sequence and pad_packed_sequence (check the lecture slides). Explain in the report why you needs them in this case (and why you didn't need them in the language model lab exercise) and what they do.
Returns the tagging accuracy, i.e. the number of correctly predicted tags.
If you have time, you can also explore more fine-grained metrics, especially in the cross-lingual case: accuracy per tag type, recall/precision/F1 per tag type, ...