Is there any query/code that can find common values in records?

Question

Somebody can guide me (maybe Simple and fast query if there is or some fast code) to convert my CSV data file (with commas separation):

1,A,C,Z,F,G
2,G,Q,R,C,
3,Z,G,Q,
4,C,F,
5,O,P,
6,O,X,Y,J,
7,A,P,X,

I have this table with ~1,000,000 records like these 7 records that you see (In real Database A,B,C,... are words in string), Records 1 and 2 are common in G and C value and 2,3 and 1,3 and ...

I want to sync records if they have at least two common value like Records 1 & 2,3,4 (but record 5,6,7 haven't at least 2 shared values with others) and generate a list like this:

1 A C Z F G Q R
2 G Q R C A Z F
3 Z G Q A C F R
4 C F A Z G Q R
5 O P
6 O X Y J
7 A P X

at the end we must have 4 same records if we sort data and one others without sync:

1 A C F G Q R Z
2 A C F G Q R Z
3 A C F G Q R Z
4 A C F G Q R Z
5 O P
6 J O X Y
7 A P X

Maybe I do not use good term for my meaning, please see:

1 A C Z F G 
2 G Q R C

record 1 has C and G common with Record 2 now 1 has not R and Q thus we must have 1 A C Z F G + Q and R and Record 2 has not A,Z and F thus we must have: 2 G Q R C + A,Z and F thus at the end we have:

1 A C Z F G Q R 
2 G Q R C A Z F

I need all records Respectively in the queue from top to bottom. wrote a delphi code but it is so slow. Someone suggest me this groovy code:

def f=[:]
new File('Data.csv').readLines().each{
def items=it.split(',')
def name
items.eachWithIndex { String entry, int i -> 
    if(i==0){
        name=entry
    }
    else if(entry){
        if(!f[entry])
            f[entry]=[]
        f[entry]<<name
    }
 } 

}
f.findAll {it.value.size()>1}

It is very fast (because of using a map file I think), but It only finds the common values.

Pure code-writing requests are off-topic on Stack Overflow — we expect questions here to relate to *specific* programming problems — but we will happily help you write it yourself! Tell us [what you've tried](https://stackoverflow.com/help/how-to-ask), and where you are stuck. This will also help us answer your question better. — Jack, Oct 31 '18 at 14:39
Wow. This one looks like a hard nut to crack on any DBMS. Not so much the sorting of the characters, but to match & combine the records based on 2 common characters could be a PITA. I can't even imagine what the reason would be for such requirement. — LukStorms, Oct 31 '18 at 15:07
How many different letters (words) are there in your alphabet (lexicon)? — Rick James, Nov 01 '18 at 06:06

LukStorms · Accepted Answer · 2019-01-06T15:05:22.007

0

If you would go for a SQL solution, then that csv data could be put in a normalized table with the data unfolded per ID & WORD.

Once you have that, it becomes a matter of self-joining that table. And concatinate the words back together in alphabetic order.

SqlFiddle test here

Not sure how fast this method would be on a table with 1000k records though.
But it's an interesting puzzle.

Sample data:

DROP TABLE IF EXISTS test_words;
CREATE TABLE IF NOT EXISTS test_words (
  id int unsigned NOT NULL PRIMARY KEY,
  words varchar(60) NOT NULL
);
INSERT INTO test_words (id, words) VALUES
(1,'A C Z F G'),
(2,'G Q R C'),
(3,'Z G Q'),
(4,'C F'),
(5,'P O'),
(6,'O X Y J'),
(7,'A P X');

Tally table with numbers:

DROP TABLE IF EXISTS tmp_numbers;
CREATE TEMPORARY TABLE IF NOT EXISTS tmp_numbers (
  n int unsigned NOT NULL PRIMARY KEY
);
INSERT INTO tmp_numbers (n) VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10);

Unfolding the words:

DROP TABLE IF EXISTS test_words_unfolded;
CREATE TABLE test_words_unfolded (
  word varchar(10) NOT NULL,
  id int unsigned NOT NULL,
  PRIMARY KEY (word, id)
);
INSERT INTO test_words_unfolded (word, id)
SELECT DISTINCT SUBSTRING_INDEX(SUBSTRING_INDEX(t.words,' ', nr.n),' ',-1) as word, t.id
FROM test_words AS t
JOIN tmp_numbers AS nr
  ON CHAR_LENGTH(t.words) - CHAR_LENGTH(REPLACE(t.words,' ','')) >= nr.n - 1
 AND SUBSTRING_INDEX(SUBSTRING_INDEX(t.words,' ', nr.n),' ',-1) != '';

Result table:

DROP TABLE IF EXISTS test_result;
CREATE TABLE IF NOT EXISTS test_result (
  id int unsigned NOT NULL PRIMARY KEY,
  words varchar(60) NOT NULL
);

INSERT INTO test_result (id, words) 
SELECT q.id, GROUP_CONCAT(DISTINCT t3.word ORDER BY t3.word ASC SEPARATOR ' ') as words
FROM 
(
    SELECT t1.id, t2.id as id2
    FROM test_words_unfolded t1
    JOIN test_words_unfolded t2 ON t1.word = t2.word
    GROUP BY t1.id, t2.id
    HAVING COUNT(*) > 1 OR t1.id = t2.id
) q
LEFT JOIN test_words_unfolded t3 ON t3.id = q.id2
GROUP BY q.id
ORDER BY q.id;

SELECT *
FROM test_result
ORDER BY id;

Result:

id  words
--  -----
1   A C F G Q R Z
2   A C F G Q R Z
3   A C F G Q R Z
4   A C F G Z
5   O P
6   J O X Y
7   A P X

Extra

To mark the words that have been added, the query to fill the result table becomes a bit more complicated.

SELECT 
 q2.id, 
 GROUP_CONCAT(DISTINCT CASE WHEN q2.ori = 1 THEN q2.word ELSE CONCAT('[',q2.word,']') END ORDER BY q2.word ASC SEPARATOR ' ') as words
FROM
(
  SELECT
   q1.id, t3.word,
   MAX(CASE WHEN q1.id = t3.id THEN 1 ELSE 0 END) as ori
  FROM 
  (
      SELECT
       t1.id, t2.id as id2
      FROM test_words_unfolded t1
      JOIN test_words_unfolded t2 ON t1.word = t2.word
      GROUP BY t1.id, t2.id
      HAVING COUNT(*) > 1 OR t1.id = t2.id
  ) q1
  LEFT JOIN test_words_unfolded t3 ON t3.id = q1.id2
  GROUP BY q1.id, t3.word
) q2
GROUP BY q2.id
ORDER BY q2.id;

Result:

id  words
--  -----
1   A C F G [Q] [R] Z
2   [A] C [F] G Q R [Z]
3   [A] [C] [F] G Q [R] Z
4   [A] C F [G] [Z]
5   O P
6   J O X Y
7   A P X

Additional experiment here

edited Jan 06 '19 at 15:05

answered Nov 03 '18 at 20:08

LukStorms

28,916
5
31
45

Thank you for your time and creative mind, I want to test it on my main database but what about if i Have this record: 10 A A B and or if I have this: 10 this is a test I mean If i have words instead chars in fields? – SmmSamm Nov 04 '18 at 22:09
@SmmSamm It's an example, so it's totally up to you to change the datatypes and sizes for your personal needs. But I've just updated and made them bigger. – LukStorms Nov 04 '18 at 23:16
thank you again and again, now how can I prevent it to add duplicate values, for example if i have 10 A A X when I have duplicate value in one records then i get this error : Duplicate entry for key 'PRIMARY' but i dont need duplicate values even they are inside recodrs, is it possible to tell it do not add duplicate in building time? – SmmSamm Nov 05 '18 at 00:36
@SmmSamm Ah right, if you have dups in the source strings then it would give that error because of the composite primary key on test_words_unfolded. Which enforces uniqueness on (word, id). I added that to increase performance on the query with the self-joins. But you could add a `DISTINCT` to the select for "test_words_unfolded". – LukStorms Nov 05 '18 at 08:31
Thank you I did It, now 2 questions: 1-How force it to add a character after added strings that I can recognize after for example at first I had 1,A,C,Z,F,G Now I want to have 1 A C F G [Q] [R] Z , with two [] I can recognize added string 2-Is it possible to read directly from my CSV ? – SmmSamm Nov 05 '18 at 17:25
@SmmSamm I've added an extra query. But the extra sub-query will make it slower. About 2, I don't know anything about selecting directly from a file in MySql. – LukStorms Nov 05 '18 at 20:52
Ok I tested it (on a small CSV by chang it like your script format) and your solution could help me, now what about if i want to know the address of fetching values for example 1 A C F G [2]Q [2]R Z which [2] means the data is from record 2? – SmmSamm Nov 06 '18 at 15:08
@SmmSamm What if you add a GROUP_CONCAT on `t3.id` in the `q2` sub-query. Then you could use that field in the `CONCAT` used in the `CASE`. – LukStorms Nov 06 '18 at 15:30
Where in the max()? – SmmSamm Nov 06 '18 at 16:29
After the max. F.e. `GROUP_CONCAT(t3.id SEPARATOR ',') as idlist` – LukStorms Nov 06 '18 at 16:41
LukStorms Dear @LukStorms, I think my text values are too long for WORDS It shows me error at first Thus I put words varchar(20000) and searched and changed STRICT_TRANS_TABLES in my.ini but Now I have Only 1412 records in my result (The result is fantastic but there are not all my records) Script gave me a warning: INSERT INTO test_words_unfolded (word, id) SELECT DISTINCT SUBSTRING_INDEX(SUBSTRING_... 13212 row(s) affected, 1 warning(s): 1265 Data truncated for column 'word' at row 15854 Records: 13212 Duplicates: 0 Warnings: 1 7.192 sec Can you help me again? Thank you – SmmSamm Nov 07 '18 at 15:10
and An Error: Please check these (1,'A'), (2,'G;Q;R;C'), (3,'Z;G;Q'), (4,'C;F'), (5,'P;O'), (6,'O;X;Y;J'), (7,'A;P;X'), (8,'G'), (9,'C;Z;F;G'); If We have some records with only one value script does not put it on the result I put these and I have only 8 records: 2 ;C;[F];G;Q;R;[Z] 3 ;[C];[F];G;Q;[R];Z 4 ;C;F;[G];[Z] 5 ;O;P 6 ;J;O;X;Y 7 ;A;P;X 9 ;C;F;G;[Q];[R];Z Records 1 & 8 are not in the result – SmmSamm Nov 07 '18 at 15:27
In some records why some value only has [ or ] for example it generated something like this: 1000 [A;B;Q;C;[W];Z,Y];I;L – SmmSamm Nov 07 '18 at 15:41
About the truncate error. Probably because you have words bigger than the size of the word varchar. And yeah, because of the HAVING rule it would ignore those with only 1 word. Unless you add `OR t1.id = t2.id` to that HAVING clause? – LukStorms Nov 07 '18 at 15:45
Dear @LukStorms , Now My data (Unicode with very long text as column data) is too big for 3 varchar(s) that you gave me before in this script, I changed to text but it shows me error because of the logic of the code, Again can you help me? – SmmSamm Jan 02 '19 at 14:17
@SmmSamm Well, I don't want to spend any more free time on this one. But I suppose there's a limit to GROUP_CONCAT. I think you can change that for the session, but have never needed it. Check out [this old SO post](https://stackoverflow.com/questions/2567000/) If that doesn't help then perhaps post a new question for your issue. – LukStorms Jan 02 '19 at 14:32
your are the master @LukStorms, It was right, but another problem, in 1000/5000/10000 records it works but when I use on 100000 records workbench stops and shows me : Lost connection to MySQL server during query 300.505 sec , I changed edit-->preferences-->Sql Editor-->Mysql Session-->keep-Alive to 1000000 but It works and works and never stops – SmmSamm Jan 05 '19 at 05:27
@SmmSamm Don't know much about that solving timeouts or connection problems. But maybe if you split the work done that huge ass query for the results. With so many records, joining on text fields can become a bottleneck. I added an experiment on the bottom of my answer. It first loads the unique words in a table, then the matching is done on the id's. More storage is needed that way. – LukStorms Jan 05 '19 at 21:00
please give me an email, It shows connection error, I think I must send some real data for you – SmmSamm Jan 06 '19 at 06:12
@SmmSamm No can do. I updated that experiment again [here](https://rextester.com/ZKT53386). The idea of splitting it up is that you run the long running queries separate, not everything at once. I can't help you with optimizations, but perhaps you could post the timeout problem you experience at [CodeReview](https://codereview.stackexchange.com/) or [DBA Administrators](https://dba.stackexchange.com/)? (not sure what stackexchange site is the best for that kind of things) – LukStorms Jan 06 '19 at 15:00
I assume DBA Administrators site, because I noticed that they transfered [some question regarding optimalization](https://dba.stackexchange.com/questions/209854/) to that SO site for some reason. – LukStorms Jan 06 '19 at 15:07
please see this link:https://dba.stackexchange.com/questions/226472/how-can-i-optimize-this-mysql- script-for-many-records – SmmSamm Jan 10 '19 at 09:56

Is there any query/code that can find common values in records?

1 Answers1