How can I create and fit vocab.bpe file (GPT and GPT2 OpenAI models) with my own corpus text?

Question

This question is for those who are familiar with GPT or GPT2 OpenAI models. In particular, with the encoding task (Byte-Pair Encoding). This is my problem:

I would like to know how I could create my own vocab.bpe file.

I have a spanish corpus text that I would like to use to fit my own bpe encoder. I have succeedeed in creating the encoder.json with the python-bpe library, but I have no idea on how to obtain the vocab.bpe file. I have reviewed the code in gpt-2/src/encoder.py but, I have not been able to find any hint. Any help or idea?

Thank you so much in advance.

Hi. I have a wild guess that [this](https://github.com/rkfg/gpt-2/blob/fromscratch/src/encoder_sp.py) is your repository. However, finally which library or way did work to create the `vocab.bpe` file? And how did you create `encoder.json` file in the first place? Thank you. — shamiul97, May 19 '20 at 09:31
I went down this custom encodings/vocab route myself, and what I came away from the experience was that it wasn't going to do what I think it would do, which is limit the output of GPT-2 to a word/character set. Was that what you were trying to do? — philipkd, Jun 04 '20 at 23:20

score 4 · Accepted Answer · answered Jun 25 '19 at 12:34

4

check out here, you can easily create the same vocab.bpe using the following command:

python learn_bpe -o ./vocab.bpe -i dataset.txt --symbols 50000

answered Jun 25 '19 at 12:34

vpcom

56
3

score 2 · Answer 2 · answered Apr 06 '19 at 08:27

I haven't worked with GPT2, but bpemb is a very good place to start for subword embeddings. According to the README

BPEmb is a collection of pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE) and trained on Wikipedia. Its intended use is as input for neural models in natural language processing.

I've used the pretrained embeddings for one of my projects along with sentencepiece and it turned out to be very useful.

Thank you so much for the answer. I will give a try and then I tell you. — rafaelmg07, Apr 08 '19 at 06:45

How can I create and fit vocab.bpe file (GPT and GPT2 OpenAI models) with my own corpus text?

2 Answers2