0

I have a MySQL table, where one of the fields contains a textual description (~5-200 words).

For example Reviews:

Rev_id    Place_id    Stars    Category    Text
1         12           3        Food       Nice food but a bad dirty place.
2         31           4        Sport      Not bad, they have everything.
3         55           1        Bar        Poor place,bad audience.

I'd like to make some word count analysis, such as general word frequency count (how many times each of the words has appeared) or top-K words per Category.

In the example:

word    count
bad     3
place   2
...

Is there a way to do it solely in MySQL without involving programming languages?

kjones
  • 1,339
  • 1
  • 13
  • 28
Dimgold
  • 2,748
  • 5
  • 26
  • 49
  • 2
    Possible duplicate of [Using SQL to determine word count stats of a text field](https://stackoverflow.com/questions/748276/using-sql-to-determine-word-count-stats-of-a-text-field) – pilsetnieks Jul 01 '17 at 10:56
  • It's not the same - I'm not interested in the number of words per text, but in the general number of times each word has appeared. – Dimgold Jul 01 '17 at 10:59
  • @MohaMad can you elaborate on that please? – Dimgold Jul 01 '17 at 11:44

1 Answers1

2

My logic for this question is: extract all words and count them!

So, create a table like your stored data:

CREATE TABLE `tbltest` (
  `Rev_id` int(11) NOT NULL AUTO_INCREMENT,
  `place_id` int(11) DEFAULT NULL,
  `Stars` int(11) DEFAULT NULL,
  `Category` varchar(45) DEFAULT NULL,
  `Text` varchar(255) DEFAULT NULL,
  PRIMARY KEY (`Rev_id`),
  UNIQUE KEY `id_UNIQUE` (`Rev_id`)
) ENGINE=InnoDB AUTO_INCREMENT=4 DEFAULT CHARSET=utf8;

And creating a table for words:

CREATE TABLE `counting` (
  `word` varchar(45) NOT NULL,
  `counts` int(11) DEFAULT NULL,
  PRIMARY KEY (`word`),
  UNIQUE KEY `word_UNIQUE` (`word`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

Now, create the MySQL Stored Procedure for splitting sentences and counting words:

drop procedure if exists sentence_words;
delimiter #
create procedure sentence_words(IN Cat VARCHAR(45))

begin

declare w_max int unsigned default 1;
declare w_counter int unsigned default 0;
declare done int unsigned default 0;

declare sentence varchar(255) default null;
declare cur cursor for select `text` from `tbltest` where `Category` = Cat;
declare continue handler for not found set done=1;
set done=0;
open cur;
    myloop: loop
        fetch cur into sentence;
        if done = 1 then leave myloop; end if;
        -- refine sentence!
        set sentence = replace(replace(replace(replace(
                sentence
        ,'.',' '),'!',' '),',',' '),';',' ');
        set sentence = replace(trim(sentence),'  ',' ');
        set w_max = length(sentence)-length(replace(sentence,' ',''))+1;
        start transaction;
        while w_counter < w_max do
            insert into `counting`(counts,word) values
                (1, substring_index( substring_index(
                    sentence,' ',w_counter+1) ,' ',-1)
                )
            ON DUPLICATE KEY UPDATE counts=counts+1;
            set w_counter=w_counter+1;
        end while;
        commit;
    end loop;
    close cur;
end #
delimiter ;

Finally, you can call the procedure and find words and counts in counting table. If you need each category word counts separated, remember to truncate or backup counting table before calling procedure for each Category.

truncate `counting`;
call sentence_words('Bar');
select * from `counting` order by counts desc; -- ? where length(word)>2
-- words | counts --
'audience', '1'
'bad', '1'
'place', '1'
'Poor', '1'
MohaMad
  • 2,575
  • 2
  • 14
  • 26