Grouping a comma separated value on common data

Question

I have a table with col1 id int, col2 as varchar(comma seperated value) and column 3 for assigning group to them. Table looks like

  col1          col2     group
..............................
       1        2,3,4       
       2        5,6        
       3        1,2,5
       4        7,8
       5        11,3
       6        22,8

this is only the sample of real data, now i have to assign a group no to them in such a way that output looks like

  col1          col2       group
..............................
       1        2,3,4       1
       2        5,6         1
       3        1,2,5       1
       4        7,8         2
       5        11,3        1
       6        22,8        2

The logic for assigning group no is that every similar comma seperated value of string in col2 have to be same group no as every where in col2 where '2' is there it has to be same group no but the complication is that 2,3,4 are together so they all three int value if found in any where in col2 will be assigned same group. the major part is 2,3,4 and 1,2,5 both in col2 have 2 so all int 1,2,3,4,5 have to assign same group no. Tried store procedure with match against on col2 but not getting desired result

Most imp( i can't use normalization,because i cant afford to make new table from my original table which have millions of record) ,even normalization is not helpfull in my context.

Achieved so far...... Ihave set the group column auto increment and then wrote this procedure:-

BEGIN
  declare cil1_new,col2_new,group_new int;
  declare done tinyint default 0;
  declare group_new varchar(100);
  declare cur1 cursor for select col1,col2,`group` from company ; 
  DECLARE CONTINUE HANDLER FOR NOT FOUND SET done=1;
  open cur1;
  REPEAT
  fetch cur1 into col1_new,col2_new,group_new;
  update company set group=group_new where
  match(col2) against(concat("'",col2_new,"'"));
  until  done end repeat;
  close cur1;
  select * from company;
END

This procedure is working,no syntex mistake but the proble is that i am not acheiving the desired result exectly.

Would you accept a solution that involved normalizing your CSV column into another table? — cha0site, Mar 30 '12 at 21:01
Storing CSV in a relational database is like towing your car using your bike. — Corbin, Mar 30 '12 at 21:03
no,because the real table is too big in gb's so i can't afford to normalize it. — , Mar 30 '12 at 21:03
@Corbin but sometimes you to works on table created by others, and we cant even alter it sometimes — , Mar 30 '12 at 21:04
Some algorithm might just be slipping my mind, but I'm fairly certain that best case this will still be a rather computationally expensive operation to perform. As far as doing it in SQL goes, I must say I have no idea. What is the end goal for this data? Would it be possible to just process it in whatever programming language you're using? Also, if the table is many GBs, how large of datasets will you be doing this operation on at a time? It's going to be VERY slow for large data sets. — Corbin, Mar 30 '12 at 21:14
Sorry then, constructing a solution that meets your requirements will take more time then I would like to spend on it. I think this link on [how to split CSV values into seperate values](http://www.slickdev.com/2008/09/15/mysql-query-real-values-from-delimiter-separated-string-ids/) will get you started. Good luck! — cha0site, Mar 30 '12 at 21:21
thanx @cha0site for listening, i m working on it from 2 days so i knw how much time it needed — , Mar 30 '12 at 21:22
What is the range of the complete set of values that can be stored in col2? How many records are there in this table and what is the maximum number of comma separated values stored in col2? This operation will be very expensive whether normalised or not. — user1191247, Mar 30 '12 at 21:27
@nnichols its like a assignment for me, i know its too expensive but have to do it,original table have millions of data and maximum no of comma seperated value is not fixed col2 is varchar 100, i am using match against inside a procedure but it assigns group one time only not in a chain — , Mar 30 '12 at 21:44
@Corbin i am working on a large database and this is a kind of assignment which if i made can be implemented on real table — , Mar 30 '12 at 21:50
i pretty sure you will spent more resource to maintain the csv value than normalized the data, plus, store csv only will give worse performance than a normalized structure. — ajreal, Mar 30 '12 at 21:51
@ajreal i cant do something to table structure, i m just stucked in it and have to do this task on same data:( — , Mar 30 '12 at 21:54
@Ankit - you did not answer my questions. How many records in the table? What is the range of values that can be stored in the comma spearated list? What is the maximum number of values stored in the comma separated list? Is this a school assignment? You should post the SP that you have tried as at least it will show that you have had a go. — user1191247, Mar 30 '12 at 22:09
@nnichols original table have millions of record..and other details i have given in previous cuments.... — , Apr 02 '12 at 04:28
Do you have the possibility to create another table, which would only provide redundancy to the information stored in this one, in a normalized way? You said you couldn't afford to make a new table from this one, but if you could ? Also, do you have real-time constraints on the col2/group mapping ? — Hugo Briand, Apr 02 '12 at 08:16
I cnat normalize because it will give a big bunch of duplicate data and second thing is the task is to assign group no in original table not in normalize table — , Apr 02 '12 at 08:18
if anything that need to be done first, it is to edit the question. I tried, but I just couldn't. — Sufendy, Apr 02 '12 at 08:33
@ankit, lack of punctuation is one, there is no comma or fullstops. That makes it so hard to read. not too worry, my English is not perfect either, I will try again. But really hope that someone better will edit it. — Sufendy, Apr 02 '12 at 08:40
@Phelios editing question are most welcome, i am bad in puntuations,form of verb etc so neglect it if you can and due to social network chatting habbit i am becoming bad in spells also due to short forms etc..... — , Apr 02 '12 at 08:43
I do not understand the criteria for assigning group numbers 1 and 2. It seems like you're saying that if col2 contains 2 then assign group 1; otherwise 2. In that case the second row (5,6) should be in group 2, no? — Salman A, Apr 02 '12 at 09:14
@Salman: I think he's trying to implement a sort of poor man's "Union-Find" data structure, where if one of the values in the CSV from two rows are the same, then those two rows have the same group ID. — cha0site, Apr 02 '12 at 13:15
Is there a fulltext index on col2? Otherwise you can't use match().. against as it special for fulltext index. — Andreas Wederbrand, Apr 07 '12 at 16:36

score 0 · Answer 1 · answered Apr 07 '12 at 20:25

It is possible to do but I'm not sure how long this will take on your very big table. I assume that you are allowed to create a new tables holding all groups and there numbers as the group column is populated.

Also, this cannot run on a live table. It isn't possible to write it so it's not a limitation of my design. Think about what would happen if you added a new row with values 7 and '6,7', that would bridge group 1 and 2 and all work would have to be dropped.

This proc needs to be re-run every time there is additions to the table. If that is not acceptable run it once and then replace it with triggers that maintain the values and merges groups when needed.

Here is the procedure. It could benefit from some modularization but it works. I've taken Jay Pipes split_string function and included it.

First the DDL and some test data

CREATE TABLE `company` (
  `col1` int(11) DEFAULT NULL,
  `col2` varchar(100) DEFAULT NULL,
  `grp` int(11) DEFAULT NULL
);

CREATE TABLE `groups` (
  `number` int(11) NOT NULL DEFAULT '0',
  `grp` int(11) NOT NULL DEFAULT '0',
  `processed` tinyint(1) DEFAULT NULL,
  PRIMARY KEY (`number`,`grp`),
  KEY `grp` (`grp`)
);

insert into company (col1, col2) values 
(1,'2,3,4'),       
(2,'5,6'),        
(3,'1,2,5'),
(4,'7,8'),
(5,'11,3'),
(6,'22,8');

And now the procedure

use test;

drop procedure if exists group_it;
delimiter //

create procedure group_it ()
begin                        
  declare current_group int default 0;
  declare ids varchar(100);

  -- clear out all data from before
  update company set grp = null;
  truncate groups;

  main: loop                                
    -- take one unmapped (new group)
    set ids := null;
    select col2 into ids from company where grp is null limit 1;
    if ids is null then
      leave main;
    end if;
    set current_group := current_group + 1;

    --  put each value into groups table and mark as unprocessed
    call split_string(ids, ',');
    insert into groups select value, current_group, false from SplitValues;

    -- while unprocessed value in groups
    begin
      declare unprocessed int;

      unprocessed: loop
        set unprocessed = null;
        select number
          into unprocessed
          from groups
         where not processed
         limit 1;

        if unprocessed is null then
          leave unprocessed;
        end if;

        begin
          -- find all rows in company that matches this group
          declare row_id int;
          declare ids2 varchar(100);

          declare cur2_done boolean;
          declare cur2 cursor for
            select col1, col2 
              from company
             where col2 regexp concat('^', unprocessed, '$')
                or col2 regexp concat('^', unprocessed, ',')
                or col2 regexp concat(',', unprocessed, '$')
                or col2 regexp concat(',', unprocessed, ',');

          declare continue handler for not found set cur2_done := true;

          open cur2;    
          numbers: loop
            set cur2_done := false;
            fetch cur2 into row_id, ids2; 
            if cur2_done then
                close cur2;
                leave numbers;
            end if;

            update company set grp = current_group where col1 = row_id;
            --  add all new values to groups marked as unprocessed
            call split_string(ids2, ',');   
            insert ignore into groups select value, current_group, false from SplitValues;
          end loop numbers;
          update groups set processed = true where number = unprocessed;
        end;
      end loop unprocessed;
    end;
  end loop main;
end//

delimiter ;

This is Jay Pipes split_string

DELIMITER //

DROP PROCEDURE IF EXISTS split_string //
CREATE PROCEDURE split_string (
IN input TEXT
, IN `delimiter` VARCHAR(10)
)
SQL SECURITY INVOKER
COMMENT
'Splits a supplied string using using the given delimiter,
placing values in a temporary table'
BEGIN
DECLARE cur_position INT DEFAULT 1 ;
DECLARE remainder TEXT;
DECLARE cur_string VARCHAR(1000);
DECLARE delimiter_length TINYINT UNSIGNED;

DROP TEMPORARY TABLE IF EXISTS SplitValues;
CREATE TEMPORARY TABLE SplitValues (
value VARCHAR(1000) NOT NULL PRIMARY KEY
) ENGINE=MyISAM;

SET remainder = input;
SET delimiter_length = CHAR_LENGTH(delimiter);

WHILE CHAR_LENGTH(remainder) > 0 AND cur_position > 0 DO
SET cur_position = INSTR(remainder, `delimiter`);
IF cur_position = 0 THEN
SET cur_string = remainder;
ELSE
SET cur_string = LEFT(remainder, cur_position - 1);
END IF;
IF TRIM(cur_string) != '' THEN
INSERT INTO SplitValues VALUES (cur_string);
END IF;
SET remainder = SUBSTRING(remainder, cur_position + delimiter_length);
END WHILE;

END //

DELIMITER ;

Grouping a comma separated value on common data

1 Answers1

Linked