8

I plan to train ELMo or Bert model from scratch based on data(notes typed by people) on hand. The data I have now is all typed by different people. There are problems with spelling, formatting, and inconsistencies in sentences. After read the ELMo and Bert papers, I know that both models use a lot of sentences like from Wikipedia. I haven't been able to find any processed training samples or any preprocessing tutorial for Emlo or Bert model. My question is:

  • Does the Bert and ELMo models have standard data preprocessing steps or standard processed data formats?
  • Based on my existing dirty data, is there any way to preprocess this data so that the resulting word representation is more accurate?
Jerry Yang
  • 509
  • 4
  • 11
Xin
  • 331
  • 1
  • 3
  • 8

1 Answers1

7

Bert uses WordPiece embeddings which somewhat helps with dirty data. https://github.com/google/sentencepiece

Also Google-Research provides data preprocessing in their code. https://github.com/google-research/bert/blob/master/tokenization.py

Default Elmo implementation takes tokens as the output (if you provide an untokenized string, it will split it on spaces). Thus spelling correction, deduplication, lemmatization (e.g. as in spacy https://spacy.io/api/lemmatizer), separating tokens from punctuation and other standard preprocessing methods may help.

You may check standard ways to preprocess text in the NLTK package. https://www.nltk.org/api/nltk.tokenize.html (for example Twitter tokenizer). (Beware that NLTK is slow by itself). Many machine learning libraries provide their basic preprocessing (https://github.com/facebookresearch/pytext https://keras.io/preprocessing/text/)

You may also try to experiment and provide bpe-encodings or character n-grams to the input.

It also depends on the amount of data that you have; the more data you have, the less is the benefit of preprocessing (in my opinion). Given that you want to train Elmo or Bert from scratch, you should have a lot of data.

Denis Gordeev
  • 454
  • 2
  • 9
  • Can I ask why does larger data set benefit less from preprocessing? Is it because of computing cost? I was also wondering is there any typical length for each training sequence. I think I may need to cut each of my training examples, since each of them is like 2000 long. – Xin Mar 01 '19 at 14:16
  • 1
    Sorry, bad wording. I meant quite the opposite. The more data you have, the less is the negative impact of misspellings because you have more examples of typos, orthographic errors and so on. Sequence length depends on the dataset. If 2000 is enough, then go with it. Bert is expensive, so they use 512 seq. length. You may check their recommendations here https://github.com/google-research/bert#pre-training-tips-and-caveats – Denis Gordeev Mar 04 '19 at 07:05