FastText vectors with spaCy: A Tutorial

FastText has recently gained popularity among developers and researchers as the word embedding of choice, alongside GloVe, word2vec, StarSpace, RAND-WALK etc. By default, spaCy expects you to provide a word2vec vector. However, you can use your fastText vectors if you want to!

Train the Dragon

The first step is to obviously train your fastText vector. After the training process, you should be left with a .vec and a .bin file. We will be needing the .vec file for the exercise.

Transformation

We actually have a script in the spaCy source code examples for using a fastText vector with spaCy. The script has the usage and argument list explained.

Saving for the future

Now, modify the script a bit. At the end of main, just save your nlp object onto the disk.

nlp.to_disk('dir_name')

And voila. You're done. Now you can use your model and test out incredible stuff.

Testing

Now, just to test if our model is working [of course please choose words corresponding to your vector. I am choosing mine from the Bengali language]

import spacy

nlp = spacy.load('dir_name')
doc = nlp('বাংলা বাংলাদেশ')

print(doc[0].similarity(doc[1])

You should get a non-zero number if everything worked out fine.

Credits

support.prodi.gy