My training data contains line breaks; how can I work with Gensim's LineSentence format for the corpus_file parameter?

Question

Per Gensim's documentation, changelog, and previous StackOverflow answers, I know that passing training data in the LineSentence format to the corpus_data parameter can dramatically speed up Any2Vec training.

Documentation on the LineSentence format reads as follows:

Iterate over a file that contains sentences: one line = one sentence. Words must be already preprocessed and separated by whitespace.

My training data is comprised of tens of millions (and potentially 1xx million) of sentences extracted from plaintext files using spaCy. A sample sentence quite often contains one or more line break characters (\n).

How can I make these samples compatible with the LineSentence format? As far as I understand, these samples should be "understood" in the context of their linebreaks, as these breaks are present in the target text (data not trained upon). That means I can't just remove them from the training data.

Do I escape the newline characters with \\n? Is there a way to pass a custom delimiter?

I appreciate any guidance. Thanks in advance.

score 0 · Answer 1 · answered Dec 11 '21 at 06:36

LineSentence is only an appropriate iterable for classes like Word2Vec, that expect a corpus to be a Python sequence, where each item is a list of tokens.

The exact placement of linebreaks is unlikely to make much difference in usual word-vector training. If reading a file line-by-line, all words on the same line will still appear in each other's contexts. (Extra linebreaks will just prevent words from 'bleeding' over slightly into the contexts of preceding/subsequent texts – which in a large training corpus probably makes no net difference for end results.)

So mainly: don't worry about it.

If you think it might be a problem you could try either...

Removing the newlines between texts that you conjecture "snould" be a single text, creating longer lines. (Note, though, that you dn't want any of your texts to be over 10000 tokens, or else an internal implementation limit in Gensim will mean tokens past the 1st 10000 will be ignored.)
Replacing the newlines, in texts that you conjecture "should" be a single text, with some synthetic token, like say <nl> (or whatever).

...then evaluate whether the results have improved over simply not doing that. (I doubt they will improve, for basic Word2Vec/FastText training.)

For Doc2Vec, you might have to pay more attention to ensuring that all words of a 'document' are handled as a single text. In that case, you should make sure that whatever iterable sequence you have that produces TaggedDocument-like objects assigns the desired, same tag to all raw text that should be considered part of the same document.

Wow, thanks for your speedy response @gojomo! I appreciate your willingness to help. — Andrew Parsons, Dec 11 '21 at 17:28
Oops, I didn't realize that hitting `Enter` would submit my reply. I should have clarified: I am trying to train `Doc2Vec` models, and ideally, with relative speed. My target data is comprised of legal text; for example, my use cases involve searching for similar contract clauses. A document (in the NLP sense) is one to four sentences in length (but sometimes shorter i.e. section headings). Token length won't be an issue. I will experiment with removing `\n`, per your first suggestion. If I'm reading this correctly, you imply that line breaks have little bearing on the end result. — Andrew Parsons, Dec 11 '21 at 17:44
In the `Doc2Vec` case, you should be sure text that should be considered part of the same 'document' is either (1) fed as one batch, with its appropriate `tag`; or (2) if fed as separate batches, repeat the `tag` indicating they're the same doc. The `gensim.models.doc2vec.TaggedLineDocument` class will read documents, 1 from each line, from a text file – giving each doc a single `int` tag representing its line-number. — gojomo, Dec 11 '21 at 18:17

My training data contains line breaks; how can I work with Gensim's LineSentence format for the corpus_file parameter?

1 Answers1