I have very large collection of numerical and alphanumerical sets and I would like to find common words/phrases within across it with python 2.7.
Example data, nothing close to my real data, but this does a good job representing it.
'this is a test of the hosting',
'test is a test',
'we have more tests to run before we can trust it',
'if it true, can trust it',
'tom is on time for ounce',
'what do you mean tom is out sick again'
The following types of matching I am looking for
'is' x 5
'test' x 3
'is a test' x 2
'is a' x2
'we' x2
'trust it' x 2
'tom' x 2
..etc..
Is there a common lib for this or do I need to write one? I can do this with brute force but on some of my larger files this could take years. I 'assume' this is a common problem and some smart cookies have found a solution for it. Hope this isn't a traveling salesman.