0

I have a table in mysql as mentioned below :

create table test (
id int,
page_id varchar(2),
name varchar(255));

insert into test values 
(1,'P1','a,b,c,d,e'),
(1,'P2','f,h,z,a,c'),
(1,'P3','x,yz,g,c'),
(2,'P1','a,z'),
(2,'P2','a,c,v,b');

I need to count the occurrence of the values available in different pages. For e.g.

  • count of "a" is 2 for id 1 because it is available in page P1 and P2.
  • count of "b" is 1 for id 1 because it is only available in
    page P1.
  • count of "c" is 3 for id 1 because it is only available in page P1,P2 and P3

The number of pages are approximately 2,00,000 for id 1 and 56,000 for id 2

Can anyone help me how to count the occurrence of "name" in id for the page.

Ergest Basha
  • 7,870
  • 4
  • 8
  • 28
AGaur
  • 185
  • 1
  • 1
  • 15
  • how the `python` tag is related to this question? @AnkurSharma – lemon Jun 04 '22 at 16:42
  • @lemon i have the option to use the python as well to insert the aggregated data into mysql via python – AGaur Jun 04 '22 at 16:54
  • Drakax this is the data :Id Page_id name 1 P1 a,b,c,d,e 1 P2 f,h,z,a,c 1 P3 x,yz,g,c 2 P1 a,z 2 P2 a,c,v,b – AGaur Jun 04 '22 at 16:55

1 Answers1

1

First suggestion from me is fixing database design and never store comma separated values.

One way of doing this is first you need to transform the name column into rows. To do so, create a number table with the maximum character length of name column.

create table numbers (
nr int );

insert into numbers values (1),(2), (3),(4), (5),(6);

Your data example:

create table test (
id int,
page_id varchar(2),
name varchar(255));

insert into test values 
(1,'P1','a,b,c,d,e'),
(1,'P2','f,h,z,a,c'),
(1,'P3','x,y,z,g,c'),
(2,'P1','a,z'),
(2,'P2','a,c,v,b');

Below query will return name column into rows

select id,
       page_id,
       SUBSTRING_INDEX(SUBSTRING_INDEX(test.name, ',', numbers.nr), ',', -1) as name
from numbers 
inner join test on CHAR_LENGTH(test.name)-CHAR_LENGTH(REPLACE(test.name, ',', ''))>=numbers.nr-1
group by id,page_id,SUBSTRING_INDEX(SUBSTRING_INDEX(test.name, ',', numbers.nr), ',', -1)
order by id asc;

Finally for your expected result, apply an outer query counting:

select id,name,count(distinct page_id,name) as page_count
from (
       select id,
       page_id,
       SUBSTRING_INDEX(SUBSTRING_INDEX(test.name, ',', numbers.nr), ',', -1) as name
from numbers 
inner join test on CHAR_LENGTH(test.name)-CHAR_LENGTH(REPLACE(test.name, ',', ''))>=numbers.nr-1
group by id,page_id,SUBSTRING_INDEX(SUBSTRING_INDEX(test.name, ',', numbers.nr), ',', -1)
     ) as t1
group by id,name
order by id asc;

https://dbfiddle.uk/?rdbms=mysql_8.0&fiddle=a5dc594b23eacc3298ca2c5a103e0fae

Note. For count(distinct page_id,name) both columns should have the same data types

Refrences:

SQL split values to multiple rows

Count distinct value pairs in multiple columns in SQL

Ergest Basha
  • 7,870
  • 4
  • 8
  • 28
  • 1
    Nice fiddle! Yes, this is exactly the right approach, as Hadley describes in https://r4ds.had.co.nz/tidy-data.html : "Each observation must have its own row." – J_H Jun 04 '22 at 20:46