In [ ]:

import re
import numpy as np
import torch as th
import torch.autograd as ag
import torch.nn.functional as F
import torch.nn as nn

Deep Learning for NLP - lab exercise 1¶

In this first lab exercise we will implement a simple bag-of-word classifier, i.e. a classifier that ignores the sequential structure of the sentence, and a classifier based on a convolutional neural network (CNN). The goal is to predict if a sentence is a positive or negative review of a movie. We will use a dataset constructed from IMDB.

Load and clean the data
Preprocess the data for the NN
Module definition
Train the network!

We will implement this model with Pytorch, the most popular deep learning framework for Natural Language Processing. You can use the following links for help:

turorials: http://pytorch.org/tutorials/
documentation: http://pytorch.org/docs/master/

Data¶

The data can be download here: http://caio-corro.fr/dl4nlp/imdb.zip

There are two files: one with positive reviews (imdb.pos) and one with negative reviews (imdb.neg). Each file contains 300000 reviews, one per line.

The following functions can be used to load and clean the data.

In [ ]:

# Tokenize a sentence
def clean_str(string, tolower=True):
    """
    Tokenization/string cleaning.
    Original taken from https://github.com/yoonkim/CNN_sentence/blob/master/process_data.py
    """
    string = re.sub(r"[^A-Za-z0-9(),!?\'\`]", " ", string)
    string = re.sub(r"\'s", " \'s", string)
    string = re.sub(r"\'ve", " \'ve", string)
    string = re.sub(r"n\'t", " n\'t", string)
    string = re.sub(r"\'re", " \'re", string)
    string = re.sub(r"\'d", " \'d", string)
    string = re.sub(r"\'ll", " \'ll", string)
    string = re.sub(r",", " , ", string)
    string = re.sub(r"!", " ! ", string)
    string = re.sub(r"\(", " \( ", string)
    string = re.sub(r"\)", " \) ", string)
    string = re.sub(r"\?", " \? ", string)
    string = re.sub(r"\s{2,}", " ", string)
    if tolower:
        string = string.lower()
    return string.strip()


# reads the content of the file passed as an argument.
# if limit > 0, this function will return only the first "limit" sentences in the file.
def loadTexts(filename, limit=-1):
    dataset=[]
    with open(filename) as f:
        line = f.readline()
        cpt=1
        skip=0
        while line :
            cleanline = clean_str(f.readline()).split()
            if cleanline: 
                dataset.append(cleanline)
            else: 
                line = f.readline()
                skip+=1
                continue
            if limit > 0 and cpt >= limit: 
                break
            line = f.readline()
            cpt+=1        

        print("Load ", cpt, " lines from ", filename , " / ", skip ," lines discarded")
    return dataset

The following cell load the first 5000 sentences in each review set.

In [ ]:

LIM = 5000
txtfile = ...  # path of the file containing positive reviews
postxt = loadTexts(txtfile,limit=LIM)

txtfile = ...  # path of the file containing negative reviews
negtxt = loadTexts(txtfile,limit=LIM)

Split the data between train / dev / test, for example by creating lists txt_train, label_train, txt_dev, ... You should take care to keep a 50/50 ratio between positive and negative instances in each set.

In [ ]:

# TODO

Converting data to Pytorch tensors¶

We will first convert data to Pytorch tensors so they can be used in a neural network. To do that, you must first create a dictionnary that will map words to integers. Add to the dictionnary only words that are in the training set (be sure to understand why we do that!).

Then, you can convert the data to tensors:

use tensors of longs: both the sentence and the label will be represented as integers, not floats!
these tensors do not require a gradient

A tensor representing a sentence is composed of the integer representation of each word, e.g. [10, 256, 3, 4]. Note that some words in the dev and test sets may not be in the dictionnary! (i.e. unknown words) You can just skip them, even if this is a bad idea in general.

In [ ]:

# TODO

Neural network definition¶

You need to implement two networks:

a simple bag of word model (note: it may be better to take the mean of input embeddings that the sum)
a simple CNN as described in the course

To simplify code, you can assume the input will always be a single sentence first, and then implement batched inputs. In the case of batched inputs, give to the forward function a (python) list of tensors.

The bag of word neural network should be defined as follows:

take as input a tensor that is a sequence of integers indexing word embeddings
retrieve the word embeddings from an embedding table
construct the "input" of the MLP by summing (or computing the mean) over all embeddings (i.e. bag-of-word model)
build a hidden represention using a MLP (1 layer? 2 layers? experiment! but maybe first try wihout any hidden layer...)
project the hidden representation to the output space: it is a binary classification task, so the output space is a scalar where a negative (resp. positive) value means the review is negative (resp. positive).

The CNN is a little bit more tricky to implement. The goal is that you implement the one presented in the first lecture. Importantly, you should add "padding" tokens before and after the sentence so you can have a convolution even when there is a single word in the input. For example, if you input sentence is ["word"], you want to instead consider the sentence ["<BOS>", "word", "<EOS>"] if your window is of size 2 or 3. You can do this either directly when you load the data, or you can do that in the neural network module.

In [ ]:

# BAG of word classifier
class CBOW_classifier(nn.Module):
    def __init__(self, vocab_size, embedding_dim):
        super(CBOW_classifier, self).__init__()
        # TODO
        # To create an embedding table: https://pytorch.org/docs/stable/nn.html#torch.nn.Embedding
        
    def forward(self, inputs):
        # TODO

Loss function¶

Create a loss function builder.

Pytorch loss functions are documented here: https://pytorch.org/docs/stable/nn.html#loss-functions
In our case, we are interested in BCELoss and BCEWithLogitsLoss. Read their documentation and choose the one that fits with your network output

In [ ]:

Training loop¶

Write your training loop!

parameterizable number of epochs
at each epoch, print the mean loss and the dev accuracy

In [ ]: