0

I have very large collection of numerical and alphanumerical sets and I would like to find common words/phrases within across it with python 2.7.

Example data, nothing close to my real data, but this does a good job representing it.

'this is a test of the hosting',
'test is a test',
'we have more tests to run before we can trust it',
'if it true,  can trust it',
'tom is on time for ounce',
'what do you mean tom is out sick again'

The following types of matching I am looking for

'is' x 5
'test' x 3
'is a test' x 2
'is a' x2
'we' x2
'trust it' x 2
'tom' x 2
..etc..

Is there a common lib for this or do I need to write one? I can do this with brute force but on some of my larger files this could take years. I 'assume' this is a common problem and some smart cookies have found a solution for it. Hope this isn't a traveling salesman.

JustBroken
  • 67
  • 1
  • 7
  • Are you looking for unigram, bigram, trigram etc. counts? – Gingerbread Jul 17 '17 at 21:12
  • I have to admit, I have no idea what you mean with unigram, bigram, trigram... But a quick look up has me thinking word level bigram/trigram/etc.. matching. Any of the matching sets I think 4 word match would be the largest set I would ever want to handle. – JustBroken Jul 17 '17 at 23:01

1 Answers1

0

I think you are looking for unigram, bigram, trigram counts. You can use NLTK library in Python to do what you want.

Also, check this link out.

Gingerbread
  • 1,938
  • 8
  • 22
  • 36