Analyze MySQL Text Data

Question

This is a strange one but I have found the Stackoverflow community to be very helpful. I have mySQL Table with a column full of parsed text data. I want to analyze the data and see in how many rows words appear.

ID     columnName
1      Car
2      Dog
3      CAR CAR car CAR

From the above example what I want returned is that the word CAR appears in two rows and the word Dog Appears in 1 row. I don't really care how much the word count is as much as in how many rows does the word appear in. The problem is that I don't know which words to search for. Is there a tool, or something I can build in python, that would show me the most popular words used and in how many rows do the words appear in. I have not idea where to start and it would be great if someone could assist me with this.

`group by columnname` is easy enough, but that's not going to help if you have `dog car` in a field... — Marc B, Jul 02 '15 at 14:27
You're looking for a needle in a haystack. If the size of the haystack minus the needle is different from the starting size of the haystack, then the needle is present. — Strawberry, Jul 02 '15 at 14:29
hahahah, Yeah I know. Definitely a needle in a haystack. Is it possible for me to first get a word count of all the words in a column and then in turn, manually eliminate the ones that I don't need. In this case for example it would come back with car = 5 and dog = 1. After that, can i search for car and see in how many rows it appears and the dog and see how many rows that appears in. This time around, i would expect the results to be car = 2 and dog = 1 — benipy, Jul 02 '15 at 14:37
This seems similar to the first part of your problem : http://stackoverflow.com/questions/17942508/sql-split-values-to-multiple-rows — PaulF, Jul 02 '15 at 14:39
definitely look into SUBSTRING. If you only need to use the first word for example in each column, then you could first grab a column of just the first words and group by these? — Dportology, Jul 02 '15 at 14:45

score 1 · Accepted Answer · answered Jul 02 '15 at 14:53

I'd use python:

1) setup python to work with mysql (loads of tutorials online)

2) define:

from collections import defaultdict
tokenDict = defaultdict(lambda: 0)

the former is a simple dictionary which returns 0 if there is no value with the given key (i.e. tokenDict['i_have_never_used_this_key_before'] will return 0)

3) read each row from the table, tokenize it and increment the token counts

tokens = row.split(' ') //tokenize
tokens = [lower(t) for t in tokens] //lowercase
tokens = set(tokens) //remove duplicates
for token in tokens:
    tokenDict[token] = tokenDict[token] + 1

Analyze MySQL Text Data

1 Answers1