Per Gensim's documentation, changelog, and previous StackOverflow answers, I know that passing training data in the LineSentence format to the corpus_data
parameter can dramatically speed up Any2Vec training.
Documentation on the LineSentence format reads as follows:
Iterate over a file that contains sentences: one line = one sentence. Words must be already preprocessed and separated by whitespace.
My training data is comprised of tens of millions (and potentially 1xx million) of sentences extracted from plaintext files using spaCy. A sample sentence quite often contains one or more line break characters (\n
).
How can I make these samples compatible with the LineSentence format? As far as I understand, these samples should be "understood" in the context of their linebreaks, as these breaks are present in the target text (data not trained upon). That means I can't just remove them from the training data.
Do I escape the newline characters with \\n
? Is there a way to pass a custom delimiter?
I appreciate any guidance. Thanks in advance.