I was given the assignment to find unique words in an arbitrary document (.txt) and their counts. I'm given a dictionary as well, and a word in the document is valid if it's in the dictionary. The assignment is to solve this problem using mysql command line with an attached database. no shell scripting, no programming. just mysql commands.
I created a table for the dictionary, and loaded words from the dictionary.txt file into it with LOAD DATA INFILE 'dictionary.txt' REPLACE INTO TABLE 'words' (word);
and it worked just fine: one word per record. But in the document file, there is no real delimiter...I could maybe read it in line by line, but how would I compare words in each line to words in the dictionary using regular expressions? Doesn't mysql return the entire field contents instead of just the matching words within a field?
Any help would be great.
EDIT: SQLFiddle seems to be down right now. Here's what I've done so far:
CREATE TABLE dictionary (word VARCHAR(50) NOT NULL UNIQUE, freq INT DEFAULT 0);
LOAD DATA INFILE 'dictionary.txt' REPLACE INTO TABLE dictionary (word);
SELECT * FROM dictionary;
yields:
+------------------------------+------+
| word | freq |
+------------------------------+------+
| a | 0 |
| aa | 0 |
| aaa | 0 |
| aaas | 0 |
| aah | 0 |
| aal | 0 |
| aalii | 0 |
| aam | 0 |
| aani | 0 |
| aardvark | 0 |
| aardvarks | 0 |
| aardwolf | 0 |
| aardwolves | 0 |
| aargh | 0 |
| aarhus | 0 |
| aaron | 0 |
.
.
.
.
| zymotize | 0 |
| zymotoxic | 0 |
| zymurgy | 0 |
| zyrenian | 0 |
| zyrian | 0 |
| zyryan | 0 |
| zythem | 0 |
| zythia | 0 |
| zythum | 0 |
| zyzomys | 0 |
| zyzzogeton | 0 |
+------------------------------+------+
300248 rows in set (0.29 sec)
The dictionary.txt file is just one word per line. (maybe including a frequency count isn't necessary here).
Then from there, I want to take an arbitrary text document, say http://www.usconstitution.net/const.txt, and read it in to a mysql table 'document' such that I can perform a search to the effect of 'find all the unique words in document.word where document.word is in dictionary.word. return each unique word and its count'.
I tried to read in the file with LOAD DATA INFILE 'const.txt' REPLACE INTO TABLE document LINES TERMINATED BY ' ' (word);
but I need the delimiter to be any whitespace, not just a space. For example, each line in const.txt is terminated by \n
, and the last word of each line is placed into the table as word\nanotherword. I'm not worrying right now about words ending in punctuation (word. word, word; word-- word" etc) so thats ok if they show up in the table with the punctuation attached.
example: SELECT * FROM DOCUMENT WHERE word REGEXP '\n';
yields:
| http://www.gutenberg.net
This |
| Gutenberg-tm,
including |
| Literary
Archive |
| to
subscribe |
| eBooks.
|
+---------------------------------------------+
3356 rows in set (0.00 sec)
Is there a way to use a regex delimiter for the LINES TERMINATED BY ''
clause? Ideally, I'd like to use \s
for any whitespace (\t
, \n
, , etc) so that
http://www.gutenberg.net
, This
, Gutenberg-tm,
, including
, Literary
, Archive
, to
, subscribe
, eBooks.
are all individual records.