3

For instance, if I have data in a column like this

data
I love book
I love apple
I love book
I hate apple
I hate apple

How can I get result like this

I = 5
love = 3
hate = 2
book = 2
apple = 3

Can we achieve this with MySQL?

Konerak
  • 39,272
  • 12
  • 98
  • 118
user704006
  • 271
  • 1
  • 5
  • 14
  • 2
    If your data would be only "I book book", what would you want returned? `I = 1, book = 2` or `I = 1, book =1 `? – Nanne Apr 13 '11 at 13:07
  • 3
    This kind of job is not prerogative of rdbms. Use a programming language. – Nicola Cossu Apr 13 '11 at 13:08
  • @nick: true, but is it possible, and how would one do it? – Konerak Apr 13 '11 at 13:09
  • I believe it is possible by using subqueries. The count of records is easy: select count(*) from table – mlewis54 Apr 13 '11 at 13:12
  • 1
    Databases are used to store *structured* data so that data manipulation can be done fast and in logical manner. You can't organize unstructured data into logical segments just by sticking it in into random, meaningless columns. Databases aren't used to count occurrences of words in a sentence, they're here to store, sort and retrieve data. As for how would one do it in a programming language - there's plethora of examples if you google it. – Michael J.V. Apr 13 '11 at 13:13
  • possible duplicate of [Using SQL to determine word count stats of a text field](http://stackoverflow.com/questions/748276/using-sql-to-determine-word-count-stats-of-a-text-field) – Vladislav Rastrusny Apr 13 '11 at 14:20
  • possible duplicate of [Count the frequency of each word](http://stackoverflow.com/questions/22410458/count-the-frequency-of-each-word) – RandomSeed Jun 12 '14 at 12:20

5 Answers5

6

Here is a solution only using a query:

SELECT SUM(total_count) as total, value
FROM (

SELECT count(*) AS total_count, REPLACE(REPLACE(REPLACE(x.value,'?',''),'.',''),'!','') as value
FROM (
SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(t.sentence, ' ', n.n), ' ', -1) value
  FROM table_name t CROSS JOIN 
(
   SELECT a.N + b.N * 10 + 1 n
     FROM 
    (SELECT 0 AS N UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9) a
   ,(SELECT 0 AS N UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9) b
    ORDER BY n
) n
 WHERE n.n <= 1 + (LENGTH(t.sentence) - LENGTH(REPLACE(t.sentence, ' ', '')))
 ORDER BY value

) AS x
GROUP BY x.value

) AS y
GROUP BY value

Here is the full working fiddle: http://sqlfiddle.com/#!2/17481a/1

First we do a query to extract all words as explained here by @peterm(follow his instructions if you want to customize the total number of words processed). Then we convert that into a sub-query and then we COUNT and GROUP BY the value of each word, and then make another query on top of that to GROUP BY not grouped words cases where accompanied signs might be present. ie: hello = hello! with a REPLACE

Community
  • 1
  • 1
Gabriel Rodriguez
  • 1,163
  • 10
  • 23
3

If you want to perform such kind of text analysis, I would recommend using something like lucene, to get the termcount for each term in the document.

csupnig
  • 3,327
  • 1
  • 24
  • 22
  • You could write a complex sql function or stored procedure to do a bunch of work to insert each term into a temporary table, then query it, I'm sure. However, this is a lot of work, and will be extraordinarily slow. Search engines such as Lucene were made for this, and they are very appropriate for this case. Alternatively, process the statements into a second terms table each time a phrase is updated. – squawknull Apr 13 '11 at 13:16
1

create a user defined function like this and use it in your query

DELIMITER $$

CREATE FUNCTION `getCount`(myStr VARCHAR(1000), myword VARCHAR(100))
    RETURNS INT
    BEGIN
    DECLARE cnt INT DEFAULT 0;
    DECLARE result INT DEFAULT 1;

    WHILE (result > 0) DO
    SET result = INSTR(myStr, myword);
    IF(result > 0) THEN 
        SET cnt = cnt + 1;
        SET myStr = SUBSTRING(myStr, result + LENGTH(myword));
    END IF;
    END WHILE;
    RETURN cnt;    

    END$$

DELIMITER ;

Hope it helps Refer This

Community
  • 1
  • 1
Akhil
  • 2,602
  • 23
  • 36
1

This query is going to take a long time to run if your table is of any decent size. It may be better to keep track of the counts in a separate table and update that table as values are inserted or, if real time results are not necessary, to only run this query every so often to update the counts table and pull your data from it. That way, you're not spending minutes to get data from this complex query.

Here's what I've for you so far. It's a good start. The only thing you need to do is modify it to iterate through the words in each row. You could use a cursor or a subquery.

Create test table:

create table tbl(str varchar(100) );
insert into tbl values('data');
insert into tbl values('I love book');
insert into tbl values('I love apple');
insert into tbl values('I love book');
insert into tbl values('I hate apple');
insert into tbl values('I hate apple');

Pull data from test table:

SELECT DISTINCT str AS Word, COUNT(str) AS Frequency FROM tbl GROUP BY str;
FreeAsInBeer
  • 12,937
  • 5
  • 50
  • 82
-1

Split-string procedure is not my job. You can find it here

http://forge.mysql.com/tools/tool.php?id=4

I wrote you the rest of code.

drop table if exists mytable;
create table mytable (
id int not null auto_increment primary key,
mytext varchar(1000)
) engine = myisam;

insert into mytable (mytext)
values ('I love book,but book sucks!What do you,think   about it? me too'),('I love apple! it rulez.,No, it sucks a lot!!!'),('I love book'),('I hate apple!!! Me too.,!'),('I hate apple');

drop table if exists mywords;
create table mywords (
id int not null auto_increment primary key,
word varchar(50)
) engine = myisam;


delimiter //
drop procedure if exists split_string //
create procedure split_string (
    in input text
    , in `delimiter` varchar(10) 
) 
sql security invoker
begin
    declare cur_position int default 1 ;
    declare remainder text;
    declare cur_string varchar(1000);
    declare delimiter_length tinyint unsigned;

    drop temporary table if exists SplitValues;
    create temporary table SplitValues (
        value varchar(1000) not null 
    ) engine=myisam;

    set remainder = input;
    set delimiter_length = char_length(delimiter);

    while char_length(remainder) > 0 and cur_position > 0 do
        set cur_position = instr(remainder, `delimiter`);
        if cur_position = 0 then
            set cur_string = remainder;
        else
            set cur_string = left(remainder, cur_position - 1);
        end if;
        if trim(cur_string) != '' then
            insert into SplitValues values (cur_string);
        end if;
        set remainder = substring(remainder, cur_position + delimiter_length);
    end while;

end //
delimiter ;


delimiter // 
drop procedure if exists single_words//
create procedure single_words()
begin
declare finish int default 0;
declare str varchar(200);
declare cur_table cursor for  select replace(replace(replace(replace(mytext,'!',' '),',',' '),'.',' '),'?',' ') from mytable;
declare continue handler for not found set finish = 1;
truncate table mywords;
open cur_table;
my_loop:loop
fetch cur_table into str;
if finish = 1 then
leave my_loop;
end if;
call split_string(str,' ');
insert into mywords (word) select * from splitvalues;
end loop;
close cur_table;
end;//
delimiter ;

call single_words();

select word,count(*) as word_count 
from mywords
group by word;

+-------+------------+
| word  | word_count |
+-------+------------+
| a     |          1 |
| about |          1 |
| apple |          3 |
| book  |          3 |
| but   |          1 |
| do    |          1 |
| hate  |          2 |
| I     |          5 |
| it    |          3 |
| lot   |          1 |
| love  |          3 |
| me    |          2 |
| No    |          1 |
| rulez |          1 |
| sucks |          2 |
| think |          1 |
| too   |          2 |
| What  |          1 |
| you   |          1 |
+-------+------------+
19 rows in set (0.00 sec)

The code must be improved in order to consider any punctuation but this is the general idea.

Nicola Cossu
  • 54,599
  • 15
  • 92
  • 98