How to compare two comma separate fields and get the count in MySQL

Question

Hi all I have a MySQL table that has a field of comma separated values

id    res
=============================
1     hh_2,hh_5,hh_6
------------------------------
2     hh_3,hh_5,hh_4
------------------------------
3     hh_6,hh_8,hh_7
------------------------------
4     hh_2,hh_7,hh_4
------------------------------

Please see the above example ,Actually i need to compare each row 'res' with other row's 'res' values and need to display count if they match with others. Please help me to get the count.

For example, IN first row 'hh_2' also exist in fourth row so we need count as 2, likewise we need to compare all in all rows

I Have run the function its working for me. but the table so big. It have million of records so my performance take time. While check one record with 50000 record take 25 sec. Suppose my input is 60 rows it take one hour. Please help me how to optimize.

   CREATE FUNCTION `combine_two_field`(s1 CHAR(96), s3 TEXT) RETURNS int(11)
    BEGIN
      DECLARE ndx INT DEFAULT 0;
      DECLARE icount INT DEFAULT 0;  
      DECLARE head1 char(10);
      DECLARE head2 char(10); 
      DECLARE head3 char(10);    
      WHILE ndx <= LENGTH(s1) DO
            SET head1 = SUBSTRING_INDEX(s3, ',', 1);    
            SET s3 = SUBSTRING(s3, LENGTH(head1) + 1 + @iSeparLen); 

            SET head2 = SUBSTRING_INDEX(s1, ',', 1);    
            SET s1 = SUBSTRING(s1, LENGTH(head2) + 1 + @iSeparLen);

            IF (head1 = head2) THEN 
                SET icount = icount + 1;
            END IF;         
       SET ndx = ndx + 1;   
      END WHILE;
      RETURN icount; 
    END 

And the table size is too big and i want to reduce fetching time also ...

UPDATE QUERY:

DROP PROCEDURE IF EXISTS `pcompare7` $$
CREATE DEFINER=`root`@`localhost` PROCEDURE `pcompare7`(IN in_analysis_id INT(11))
BEGIN

drop  table  if exists `tmp_in_results`;
CREATE TEMPORARY TABLE `tmp_in_results` (
  `t_id` INTEGER UNSIGNED NOT NULL AUTO_INCREMENT,
  `r_id` bigint(11) NOT NULL,
  `r_res` char(11) NOT NULL,
  PRIMARY KEY (`t_id`),
  KEY r_res (r_res)

)
ENGINE = InnoDB;

SELECT splite_snp(r_snp,id,ruid) FROM results  WHERE technical_status = 1 and critical_status = 1 and autosomal_status = 1 and gender_status != "NO CALL" and analys_id = in_analysis_id;


-- SELECT * FROM tmp_in_results;
-- COmpare Functionality 
SELECT a.t_id, b.id, SUM(IF(FIND_IN_SET(a.r_res, b.r_snp), 1, 0)) FROM tmp_in_results a CROSS JOIN results b GROUP BY a.t_id, b.id;
END $$

Function FOR CREATE TEMP TABLE:

  DROP FUNCTION IF EXISTS `splite_snp` $$
    CREATE DEFINER=`root`@`localhost` FUNCTION `splite_snp`(s1 TEXT, in_id bigint(96), ruid char(11)) RETURNS tinyint(1)
    BEGIN
      DECLARE ndx INT DEFAULT 0;
      DECLARE icount INT DEFAULT 0;
      DECLARE head1 TEXT;
      DECLARE head2 TEXT;
      DECLARE intpos1 char(10);
      DECLARE intpos2 char(10);
      DECLARE Separ char(3) DEFAULT ',';
      DECLARE iSeparLen INT;

      SET @iSeparLen = LENGTH( Separ );

      WHILE s1 != '' DO

            SET intpos1 = SUBSTRING_INDEX(s1, ',', 1);
            SET s1 = SUBSTRING(s1, LENGTH(intpos1) + 1 + @iSeparLen);

        INSERT INTO tmp_in_results(r_id,r_res) VALUES(in_id,intpos1);

      END WHILE;
      RETURN TRUE;
    END $$

New table structure

pc_input

id    in_res     in_id
=============================
1     hh_2     1000
------------------------------
2     hh_3     1000
------------------------------
3     hh_6     1001
------------------------------
4     hh_2     1001
------------------------------

res_snp

id    r_res     r_id
=============================
1     hh_2     999
------------------------------
2     hh_3     999
------------------------------
3     hh_9     999
------------------------------
4     hh_2     998
------------------------------
5     hh_6     998
------------------------------
6     hh_9     998
------------------------------

Result:

in_id    r_id     matches_count
=============================
1000     999     2 (hh_2,hh_3)
------------------------------
1000     998     1 (hh_2)
------------------------------
1001     999     1 (hh_2)
------------------------------
1001     998     2 (hh_2,hh_6)
------------------------------

I have add the separate index both table in_res,in_id and r_res and r_id

QUERY:

SELECT b.r_id,count(*) FROM  pc_input AS a INNER JOIN results_snps AS b ON (b.r_snp = a.in_snp) group by a.in_id,b.r_id;

But mysql server was freeze. Cloud you please suggest any other way or optimize my query.

EXPLAIN TABLE: res_snp

Field   Type    Null    Key     Default     Extra
id  bigint(11)  NO  PRI     NULL    auto_increment
r_snp   varchar(50) NO  MUL     NULL    
r_id    bigint(11)  NO  MUL     NULL

EXPLAIN TABLE: pc_input

  Field     Type    Null    Key     Default     Extra
id  bigint(11)  NO  PRI     NULL    auto_increment
in_snp  varchar(55) NO  MUL     NULL    
in_id   bigint(11)  NO  MUL     NULL

Explain Query:

id  select_type     table   type    possible_keys   key     key_len     ref     rows    Extra
1   SIMPLE  a   ALL     in_snp  NULL    NULL    NULL    192     Using temporary; Using filesort
1   SIMPLE  b   ref     r_snp   r_snp   52  rutgers22042014.a.in_snp    2861    Using where0

Please update your question to show an example of the result set you are hoping for. — O. Jones, May 19 '14 at 14:58
You probably are using structured database incorrectly, I would advise you take a look at how they are **meant** to be used in terms of different rows and tables before proceeding — secretformula, May 19 '14 at 15:13
You should read this: http://stackoverflow.com/a/3653574/447489 — fancyPants, May 19 '14 at 15:16
(to put more emphasis on the importance of proper design) read @fancyPants URL! — Ejaz, May 19 '14 at 15:28
Thanks for your replay. I cannot normalize my table because every entry have 100 rows. Probably my table have million of records. could you please assist any other way. — sivanesan, May 19 '14 at 15:47
100 comma separated values per row could easily be split onto a different table (one row for each comma separated value). But I suspect the problem is existing systems that populated this field / read this field. Something based on my solution below will do the job (it is easy to expand it to cop with more than 100 comma separated values). It might be worth splitting these fields out into a properly indexed / normalised table for your use and leave the comma separated version for other uses. You can always use a trigger on insert to take new values and insert them to your new table. — Kickstart, May 19 '14 at 15:53

Kickstart · Answer 1 · 2014-05-20T09:44:06.470

0

This is possible, but nasty. A properly normalised database would be far easier, but sometime you have to work with an existing database.

Something like this should do it (not tested). This uses a couple of sub queries to generate the numbers from 0 to 9, combined allowing a range from 0 to 99. This is then used with substring_index to split the string up, along with DISTINCT to get eleminate the duplicates that this will otherwise generate (I assume there should be no duplicates on any line - if there are they can be got rid of but it gets more complicated), then that is just used as a sub query to do the counts

SELECT aRes, COUNT(*)
FROM
(
    SELECT DISTINCT sometable.id, SUBSTRING_INDEX(SUBSTRING_INDEX(sometable.res, ',', 1 + units.i + tens.i * 10), ',', -1) AS aRes
    FROM sometable
    CROSS JOIN (SELECT 0 i UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9) units
    CROSS JOIN (SELECT 0 i UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9) tens
) Sub1
GROUP BY aRes

EDIT - now tested:-

http://www.sqlfiddle.com/#!2/0ef59/4

EDIT - Possible solution. Hopefully will be acceptably quick.

First extract your input rows into a temp table:-

CREATE TEMPORARY TABLE tmp_record
(
    unique_id   INT NOT NULL AUTO_INCREMENT,
    id  INT,
    res varchar(25),
    PRIMARY KEY (unique_id),
    KEY `res` (`res`)
);

Load the above up with your test data

INSERT INTO tmp_record (unique_id, id, res)
VALUES
(1, 1,  'hh_2'),
(2, 1,  'hh_5'),
(3, 1,  'hh_6'),
(4, 2,  'hh_3'),
(5, 2,  'hh_5'),
(6, 2,  'hh_4');

Then you can do a join as follows.

SELECT a.id, b.id, SUM(IF(FIND_IN_SET(a.res, b.res), 1, 0))
FROM tmp_record a
CROSS JOIN sometable b
GROUP BY a.id, b.id

This is joining every input row with every row on your main table and checking if the individual input res in in the comma separated list. If it is then the IF returns 1, else 0. Then it is summing up those values, grouped by the 2 ids.

Not tested but hopefully this should work. I am unsure on performance (which might be slow as you are dealing with a LOT of potential records).

Note that temp tables only last for the length of time the connection to the database exists. If you need to do this over several scripts then you will probably need to create a normal table (and remember to drop it when you have finished with it)

edited May 20 '14 at 09:44

answered May 19 '14 at 15:25

Kickstart

21,403
2
21
33

Thanks for your replay. My table too large(50 million records). When I was check your query it take more time. Cloud you please assist any other way. – sivanesan May 19 '14 at 15:57
Is it 50 million rows, each with ~100 comma separated values? Do some have more? Do you have another table that lists all the possible values in the comma separated list? Problem is that someone has designed the data (with a comma separated field) with no regard to how databases work. – Kickstart May 19 '14 at 15:58
Thanks for your quick replay. Sorry for inconvenience its around 10 lakhs records. SUppose I have 30 records as input (Every records have 100 comma separated value) need to compare the these records with 10 laks records and I need to get the count of matcing values of every rows and id of the row. – sivanesan May 19 '14 at 16:33
About to go home so just a quick reply. It might be easiest to take the 30 records and split out the comma separated values and build a temp table (so 100 rows on the temp table per record). Then have a query that joins the temp table against your 10 laks records using the FIND_IN_SET() mysql function. – Kickstart May 19 '14 at 16:57
Sorry for delay reply. How can I get the matched count? Note: My input minimum 30 records and maximum 3000 records. – sivanesan May 20 '14 at 08:44
Looking at this now I an slightly confused. If you have 30 records (each with 100 comma separated values), how do you want to compare these the the 10 laks records? Do you just want to compare the comma separated values, or the id and the comma separated values? What do you want to count? If one of the comma separated values appears in 2 rows of the 30 records, do you want to count that twice in the main table or once? – Kickstart May 20 '14 at 09:14
Sorry I thing my explanation is wrong. My input contains minimum 30 and maximum 3000 rows. Each row have 100 comma separated values. Each row need to compare against 10 lakh records and get matched count and id of the each rows. For example: I have input #1 and #2. (from the above sample) First need to compare #1 to #2,#3,#4 and get the count.like Compare #1 and #2 -> matched count 1(hh_5) Compare #1 and #3 -> matched count 1(hh_6) Compare #1 and #4 -> matched count 1(hh_2) Then need to compare #2 to #3,#4 Compare #2 and #3 -> matched count 0 Compare #2 and #4 -> matched count 1(hh_4) – sivanesan May 20 '14 at 09:25
OK, added a suggestion but performance might still be a problem. – Kickstart May 20 '14 at 09:44
Ok thanks for your reply. I will check and update you – sivanesan May 20 '14 at 10:11
Thanks I was checked mysql server disconnect. Its take more time. I have update the my query in under "UPDATE QUERY". Cloud you please assist I am in correct way. – sivanesan May 20 '14 at 12:10
I think you are doing it in the way I suggested (I rarely write SQL stored procedures / functions though). It would be interesting to know which bits take the time (ie, the extract to the temp table or the compare). Possible you might be able to extract the values of res for a line more quickly (something similar to my original suggested code). If this is taking too long then the only thing I can suggest is to extract all the other records to another temp table. The extract will take a while but the join would be far quicker – Kickstart May 20 '14 at 12:28
Hi Thanks for your reply. As per input I have create two tables pc_input and res_snp. pc_input table have 30*100 = 3000 records with id. res_snp table have 150000*100 = 15000000 records with id. I have mentioned table structure under the New table structure. Now I have implemeted the partiotion for res_snp table. Now I need to get matched snp count. Cloud you please help for me. I have stuggle with this last one week or please suggest any other way. – sivanesan May 22 '14 at 09:57
Make sure you have an index that covers both res and in_id on each table. Then your table structure should allow you to do a fast join between them to get the counts – Kickstart May 22 '14 at 10:02
I added the index above field. I have update my query under the "QUERY" Cloud you please check the query. Please assist me. – sivanesan May 22 '14 at 13:32
ON res_snp do you have an index covering both r_res AND r_id? Using the column names in your data the query would be `SELECT a.in_res, a.in_id, count(*) FROM pc_input AS a INNER JOIN res_snp AS b ON (b.r_res = a.in_res) GROUP BY a.in_res, a.in_id;` . SQL fiddle for this - http://www.sqlfiddle.com/#!2/1e6eb/5 – Kickstart May 22 '14 at 13:52
Thank so munch for your query. Its very useful and it helpfued me lot to come an idea for the exact result. From your query I have updated expected result under the "Result:". Advance thanks for your help. – sivanesan May 22 '14 at 15:14
Ah, to get the result you have there try this `SELECT a.in_id, b.r_id, count(*), GROUP_CONCAT(b.r_res) FROM pc_input AS a INNER JOIN res_snp AS b ON (b.r_res = a.in_res) GROUP BY a.in_id, b.r_id;` – Kickstart May 22 '14 at 16:12
Hi Thanks so much. I getting but my pc_input table have 9000 and res_snp table have 16000000. While run my query mysql server was freeze. Cloud you please suggest optimize the query or any other way. It will help for me. Sorry for disturb. advance thanks for your help. – sivanesan May 23 '14 at 08:10
I don't think there is any further way to optimize it. Try doing an EXPLAIN of the query and post the results here. – Kickstart May 23 '14 at 08:16
Hi thanks for your quick replay. Thanks for your help. I have post the EXPLAIN. Please check and let me i am wrong. Sorry for disturb you. Thanks. – sivanesan May 23 '14 at 08:43
It is an EXPLAIN of the query that is needed. Do you have a key covering both r_snp and r_id on your res_snp table? – Kickstart May 23 '14 at 08:49
@sivanesan The reason your computer is freezing is because of the problems highlighted in the comments to your question, the database is designed wrongly, there is no way to get this down to reasonable times. That isn't to say that you can't eek out slightly more oomph of the current queries, but you're **never** going to get this to go fast. This is akin to asking why your volkswagen beetle won't lift off and go to the moon, and how to fix that problem. You can't. You need to redesign that database (car) to do that. If you can't, live with the problems. – Lasse V. Karlsen May 23 '14 at 08:54
@karlsen Thanks for advice, could you please suggest how do I have alter my database structure? – sivanesan May 23 '14 at 09:07
@kickstart I have added the explain for query. Please check and let me your thoughts. Thanks. – sivanesan May 23 '14 at 09:08
Does the index r_snp on table res_snp cover both the r_snp AND the r_id columns? If so (unlikely given the key length in the explain) that looks to be about the best that can be done. But it doesn't look like it should lock things up. Test it with more limited data (just 1~2 records on pc_input) and see how long it takes like that. – Kickstart May 23 '14 at 10:43

How to compare two comma separate fields and get the count in MySQL

1 Answers1