I would like to use PyTextRank
for keyphrase extraction. How can I feed feed 5 million documents (each document consisting of a few paragraphs) to the package?
This is the example I see on the official tutorial.
text = "Compatibility of systems of linear constraints over the set of natural numbers. Criteria of compatibility of a system of linear Diophantine equations, strict inequations, and nonstrict inequations are considered. Upper bounds for components of a minimal set of solutions and algorithms of construction of minimal generating sets of solutions for all types of systems are given. These criteria and the corresponding algorithms for constructing a minimal supporting set of solutions can be used in solving all the considered types systems and systems of mixed types.\n"
doc = nlp(text)
for phrase in doc._.phrases:
ic(phrase.rank, phrase.count, phrase.text)
ic(phrase.chunks)
Is my option only to concatenate several million documents to a single string and pass it to nlp(text)
? I do not think I could use nlp.pipe(texts)
as I want to create one network by computing words/phrases from all documents.