How to parse an arbitrary document into mysql

Question

I was given the assignment to find unique words in an arbitrary document (.txt) and their counts. I'm given a dictionary as well, and a word in the document is valid if it's in the dictionary. The assignment is to solve this problem using mysql command line with an attached database. no shell scripting, no programming. just mysql commands.

I created a table for the dictionary, and loaded words from the dictionary.txt file into it with LOAD DATA INFILE 'dictionary.txt' REPLACE INTO TABLE 'words' (word); and it worked just fine: one word per record. But in the document file, there is no real delimiter...I could maybe read it in line by line, but how would I compare words in each line to words in the dictionary using regular expressions? Doesn't mysql return the entire field contents instead of just the matching words within a field?

Any help would be great.

EDIT: SQLFiddle seems to be down right now. Here's what I've done so far:

CREATE TABLE dictionary (word VARCHAR(50) NOT NULL UNIQUE, freq INT DEFAULT 0);

LOAD DATA INFILE 'dictionary.txt' REPLACE INTO TABLE dictionary (word);

SELECT * FROM dictionary; yields:

+------------------------------+------+
| word                         | freq |
+------------------------------+------+
| a                            |    0 | 
| aa                           |    0 | 
| aaa                          |    0 | 
| aaas                         |    0 | 
| aah                          |    0 | 
| aal                          |    0 | 
| aalii                        |    0 | 
| aam                          |    0 | 
| aani                         |    0 | 
| aardvark                     |    0 | 
| aardvarks                    |    0 | 
| aardwolf                     |    0 | 
| aardwolves                   |    0 | 
| aargh                        |    0 | 
| aarhus                       |    0 | 
| aaron                        |    0 | 
.
.
.
.
| zymotize                     |    0 | 
| zymotoxic                    |    0 | 
| zymurgy                      |    0 | 
| zyrenian                     |    0 | 
| zyrian                       |    0 | 
| zyryan                       |    0 | 
| zythem                       |    0 | 
| zythia                       |    0 | 
| zythum                       |    0 | 
| zyzomys                      |    0 | 
| zyzzogeton                   |    0 | 
+------------------------------+------+
300248 rows in set (0.29 sec)

The dictionary.txt file is just one word per line. (maybe including a frequency count isn't necessary here).

Then from there, I want to take an arbitrary text document, say http://www.usconstitution.net/const.txt, and read it in to a mysql table 'document' such that I can perform a search to the effect of 'find all the unique words in document.word where document.word is in dictionary.word. return each unique word and its count'.

I tried to read in the file with LOAD DATA INFILE 'const.txt' REPLACE INTO TABLE document LINES TERMINATED BY ' ' (word); but I need the delimiter to be any whitespace, not just a space. For example, each line in const.txt is terminated by \n, and the last word of each line is placed into the table as word\nanotherword. I'm not worrying right now about words ending in punctuation (word. word, word; word-- word" etc) so thats ok if they show up in the table with the punctuation attached.

example: SELECT * FROM DOCUMENT WHERE word REGEXP '\n'; yields:

| http://www.gutenberg.net

This              | 
| Gutenberg-tm,
including                     | 
| Literary
Archive                            | 
| to
subscribe                                | 
| eBooks.
                                    | 
+---------------------------------------------+
3356 rows in set (0.00 sec)

Is there a way to use a regex delimiter for the LINES TERMINATED BY '' clause? Ideally, I'd like to use \s for any whitespace (\t, \n, , etc) so that http://www.gutenberg.net, This, Gutenberg-tm,, including, Literary, Archive, to, subscribe, eBooks. are all individual records.

How does your file look like ? Note that the regex flavor of mysql isn't powerful like PCRE or other flavors, so I can't really see a valid use-case ... — HamZa, Oct 16 '13 at 07:14
It looks interesting, but can you provide sample data? (in both files) with desired result? — Alma Do, Oct 16 '13 at 07:14
Think there's plenty here who'd enjoy looking at this if you could post your table + data in [SQLFiddle](http://sqlfiddle.com/) — Steve Chambers, Oct 16 '13 at 10:29

score 1 · Accepted Answer · edited May 23 '17 at 10:25

Create table for document using TEXT to store each whole line:

CREATE TABLE document (line TEXT);

Load document into table with one row per line:

LOAD DATA INFILE 'const.txt' INTO TABLE document LINES TERMINATED BY '\n' (line);

Then use the following query:

SELECT word,
       SUM(
           ROUND((LENGTH(CONCAT(' ', line, ' ')) - 
                  LENGTH(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(
                         REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(
                         CONCAT(' ', LOWER(line), ' '),
                         '.', ' '),
                         '?', ' '),
                         '!', ' '),
                         ',', ' '),
                         ';', ' '),
                         ':', ' '),
                         '"', ' '),
                         '\n', ' '),
                         '\r', ' '),
                         '\t', ' '),
                         CONCAT(' ', word, ' '), '  ')))
                 / LENGTH(word))
          ) AS freq
FROM dictionary
CROSS JOIN document
GROUP BY word

Explanation: The query is based on this excellent answer by @Yannis for counting occurrences of text within text. Before doing this, it adds a space to both the start and end of the line and converts punctuation (.?!,;:"), tabs and return characters into spaces. (You may wish to add to this list, e.g. brackets, tabs, slashes etc., depending on what is being allowed as word separators.) The replacements are then of course done against a dictionary word with a space added before and after. ACROSS JOIN is used to line up each word in the dictionary against each line in the document and a GROUP BY is used to SUM the counts of each word.

Sample results: Have checked a very small dictionary against the const.txt mentioned in the post and it gave the following:

word        freq
================
a           97
aardvark    0
and         264
appointment 2
at          23
zyzzogeton  0

This is great for a small dictionary, thank you. I just can't seem to get it to run efficiently for a large dictionary (300,000+ words). Maybe I'll keep thinking about how to tweak it. — ejsuncy, Oct 17 '13 at 00:37
One thing you could try is to load the whole document into a single row in the table rather than one row per line and then include return characters in the list that are replaced by spaces. Would expect this to be faster. Can't try it right now as not on my laptop but please let me know if you have any problems... — Steve Chambers, Oct 17 '13 at 06:49

How to parse an arbitrary document into mysql

1 Answers1