I plan to train ELMo or Bert model from scratch based on data(notes typed by people) on hand. The data I have now is all typed by different people. There are problems with spelling, formatting, and inconsistencies in sentences. After read the ELMo and Bert papers, I know that both models use a lot of sentences like from Wikipedia. I haven't been able to find any processed training samples or any preprocessing tutorial for Emlo or Bert model. My question is:
- Does the Bert and ELMo models have standard data preprocessing steps or standard processed data formats?
- Based on my existing dirty data, is there any way to preprocess this data so that the resulting word representation is more accurate?