Find phrases related to each other across multiple files

Question

I have an issue trying to approach this issue, there is a folder with 6000 text files. What I need is to find phrases that repeat across all these files and include it in a report. The issue goes beyond a regular grep -Hl <phrase> Folder/*.txt The issue is that I dont know the phrase to capture, is supposed to scan all documents and get 5 word segments and look around on the rest of the documents to find a match.

If there is a way that this can be achieved using python, I am all ears. I have think about NTLK or Machine Learning but would need more details about it.

Can you add an example? Does ```5 word segments``` mean any five-word-grouping in a document? — wwii, Nov 12 '16 at 01:31
Are you looking to first find a good candidate string to use to cluster the files? — gowrath, Nov 12 '16 at 01:43

score 0 · Answer 1 · edited May 23 '17 at 12:30

0

Look into the n-gram methodology. You can parse for "five-gram" segments within the files.

Here is a good example of how to use n-grams to find patterns in text. You would need to decide on a way to search through all of the text files. If they are small enough, you could combine them, or read them into a string, and parse from there.

Another way to use n-grams.

edited May 23 '17 at 12:30

Community

1
1

answered Nov 12 '16 at 01:47

solvador

95
1
10

Find phrases related to each other across multiple files

1 Answers1