0

I have an issue trying to approach this issue, there is a folder with 6000 text files. What I need is to find phrases that repeat across all these files and include it in a report. The issue goes beyond a regular grep -Hl <phrase> Folder/*.txt The issue is that I dont know the phrase to capture, is supposed to scan all documents and get 5 word segments and look around on the rest of the documents to find a match.

If there is a way that this can be achieved using python, I am all ears. I have think about NTLK or Machine Learning but would need more details about it.

1 Answers1

0

Look into the n-gram methodology. You can parse for "five-gram" segments within the files.

Here is a good example of how to use n-grams to find patterns in text. You would need to decide on a way to search through all of the text files. If they are small enough, you could combine them, or read them into a string, and parse from there.

Another way to use n-grams.

Community
  • 1
  • 1
solvador
  • 95
  • 1
  • 10