0

This is a strange one but I have found the Stackoverflow community to be very helpful. I have mySQL Table with a column full of parsed text data. I want to analyze the data and see in how many rows words appear.

ID     columnName
1      Car
2      Dog
3      CAR CAR car CAR

From the above example what I want returned is that the word CAR appears in two rows and the word Dog Appears in 1 row. I don't really care how much the word count is as much as in how many rows does the word appear in. The problem is that I don't know which words to search for. Is there a tool, or something I can build in python, that would show me the most popular words used and in how many rows do the words appear in. I have not idea where to start and it would be great if someone could assist me with this.

benipy
  • 111
  • 1
  • 4
  • 11
  • `group by columnname` is easy enough, but that's not going to help if you have `dog car` in a field... – Marc B Jul 02 '15 at 14:27
  • You're looking for a needle in a haystack. If the size of the haystack minus the needle is different from the starting size of the haystack, then the needle is present. – Strawberry Jul 02 '15 at 14:29
  • hahahah, Yeah I know. Definitely a needle in a haystack. Is it possible for me to first get a word count of all the words in a column and then in turn, manually eliminate the ones that I don't need. In this case for example it would come back with car = 5 and dog = 1. After that, can i search for car and see in how many rows it appears and the dog and see how many rows that appears in. This time around, i would expect the results to be car = 2 and dog = 1 – benipy Jul 02 '15 at 14:37
  • This seems similar to the first part of your problem : http://stackoverflow.com/questions/17942508/sql-split-values-to-multiple-rows – PaulF Jul 02 '15 at 14:39
  • definitely look into SUBSTRING. If you only need to use the first word for example in each column, then you could first grab a column of just the first words and group by these? – Dportology Jul 02 '15 at 14:45

1 Answers1

1

I'd use python:

1) setup python to work with mysql (loads of tutorials online)

2) define:

from collections import defaultdict
tokenDict = defaultdict(lambda: 0)

the former is a simple dictionary which returns 0 if there is no value with the given key (i.e. tokenDict['i_have_never_used_this_key_before'] will return 0)

3) read each row from the table, tokenize it and increment the token counts

tokens = row.split(' ') //tokenize
tokens = [lower(t) for t in tokens] //lowercase
tokens = set(tokens) //remove duplicates
for token in tokens:
    tokenDict[token] = tokenDict[token] + 1
Martin Boyanov
  • 416
  • 3
  • 13