Phrase/multi-word and counting matching across large data sets

Question

I have very large collection of numerical and alphanumerical sets and I would like to find common words/phrases within across it with python 2.7.

Example data, nothing close to my real data, but this does a good job representing it.

'this is a test of the hosting',
'test is a test',
'we have more tests to run before we can trust it',
'if it true,  can trust it',
'tom is on time for ounce',
'what do you mean tom is out sick again'

The following types of matching I am looking for

'is' x 5
'test' x 3
'is a test' x 2
'is a' x2
'we' x2
'trust it' x 2
'tom' x 2
..etc..

Is there a common lib for this or do I need to write one? I can do this with brute force but on some of my larger files this could take years. I 'assume' this is a common problem and some smart cookies have found a solution for it. Hope this isn't a traveling salesman.

I have to admit, I have no idea what you mean with unigram, bigram, trigram... But a quick look up has me thinking word level bigram/trigram/etc.. matching. Any of the matching sets I think 4 word match would be the largest set I would ever want to handle. — JustBroken, Jul 17 '17 at 23:01

score 0 · Accepted Answer · answered Jul 17 '17 at 23:55

0

I think you are looking for unigram, bigram, trigram counts. You can use NLTK library in Python to do what you want.

Also, check this link out.

answered Jul 17 '17 at 23:55

Gingerbread

1,938
8
22
36

As soon as I saw your unigram, bigram, trigram and did a search for 'python unigram bigram trigram' I found a lot about it. Thank you! – JustBroken Jul 18 '17 at 01:21
@JustBroken: Anytime :) Often times, just a slight hint will get you what you want! – Gingerbread Jul 18 '17 at 17:29

Phrase/multi-word and counting matching across large data sets

1 Answers1