1

Here we are comparing entries in table

CREATE TABLE a
(id INT PRIMARY KEY,
p1 INT, p2 INT, p3 INT, .. , p15 INT)

p(n) takes the value from 0 to 2

I have to get all entries with unique combinations of parameteres. It's not difficult task, so I created a table like this

CREATE TEMPORARY TABLE b AS
(SELECT
        t1.id, 
        t2.p1, t2.p2, t2.p3, t2.p4, t2.p5, t2.p6, t2.p7, t2.p8,
        t2.p9, t2.p10, t2.p11, t2.p12, t2.p13, t2.p14, t2.p15
FROM
(
    SELECT 
        p1, p2, p3, p4, p5, p6, p7, p8, p9, p10, p11, p12, p13, p14, p15
    FROM
        a
    GROUP BY
        p1, p2, p3, p4, p5, p6, p7, p8, p9, p10, p11, p12, p13, p14, p15
    HAVING COUNT(*) = 1
)t2
LEFT JOIN a t1 on 
t2.p1 = t1.p1
AND t2.p2 = t1.p2
AND t2.p3 = t1.p3
AND t2.p4 = t1.p4
AND t2.p5 = t1.p5
AND t2.p6 = t1.p6
AND t2.p7 = t1.p7
AND t2.p8 = t1.p8
AND t2.p9 = t1.p9
AND t2.p10 = t1.p10
AND t2.p11 = t1.p11
AND t2.p12 = t1.p12
AND t2.p13 = t1.p13
AND t2.p14 = t1.p14
AND t2.p15 = t1.p15)

Here we are get enries with unique combination of parameters.

Next step is for each record in table A to find all records from the table B that differ by one, two and three parameters. Records differ by a single parameter should be no more than one, records for two different parametres no more than two etc.

For example:

 id  |    p(n)          
-----+----------------    
 1   |000000000000000       
 2   |000000000000001  

I created a temporary table of the form

CREATE TEMPORARY TABLE c AS
(
SELECT
    cnt, id1, id2
FROM
(
    SELECT 
        (t1.p1 = t2.p1)+(t1.p2 = t2.p2)
        +(t1.p3 = t2.p3) +(t1.p4 = t2.p4) +(t1.p5 = t2.p5)
        +(t1.p6 = t2.p6) +(t1.p7 = t2.p7) +(t1.p8 = t2.p8)
        +(t1.p9 = t2.p9) +(t1.p10 = t2.p10) +(t1.p11 = t2.p11)
        +(t1.p12 = t2.p12) +(t1.p13 = t2.p13) +(t1.p14 = t2.p14)
        +(t1.p15 = t2.p15) AS cnt,
        t1.id id1,
        t2.id id2
    FROM 
        b AS t1,
        a AS t2
)
WHERE
    (cnt BETWEEN 12 AND 14)
    AND (id1 < id2)
)

Here I get a table with pairs that differ by 1, 2 and 3 parameters

But I ran into a problem entries in the table quite a lot about 100,000 entries. This table is too large (processed data on a home PC) and the creation of the table gives a very long one.

Perhaps this is the only way to get everything, but can anyone have any idea of the analytical method of solving this problem than brute force couples (maybe not SQL). Of course, that will be solved much faster...

Any hint will be appreciated! Thank you!

Roman Pekar
  • 107,110
  • 28
  • 195
  • 197

2 Answers2

1

If you want a table with only unique entries, you could create a second table with all columns as part of a composite primary key:

CREATE TABLE b (
(id INT, p1 INT, p2 INT, p3 INT, .. , p15 INT)
PRIMARY KEY (p1, p2, p3, .. , p15))
IGNORE SELECT * FROM a;
Richard Wеrеzaк
  • 1,551
  • 1
  • 13
  • 17
  • Not quite what I need... I have to get rid of all the entries that have the same combination of parameters and this way only removes duplicates – Alexander Khe Aug 19 '13 at 07:22
0

May be it's not a full answer on your question, but if I've had such a task, first thing I would try is to generalize a query. It's very hard for me when I have to specify more than 3 similar columns and it's very error prone.
So, I suggest to try to pivot you columns into rows and compare differences, like ( choose whatever pivot method you like, I've just used union for sqlfiddle, you can use hstore as posted here PostgreSQL columns to rows with no explicilty specifying column names / columns):

with cte1 as (
    select id, 'p1' as name, p1 as value from a
    union all
    select id, 'p2' as name, p2 as value from a
    union all
    select id, 'p3' as name, p3 as value from a
    union all
    select id, 'p4' as name, p4 as value from a
), cte2 as (
    select
        c1.id, sum(case when c1.value = c2.value then 0 else 1 end) as diff
    from cte1 as c1
         inner join cte1 as c2 on c2.id <> c1.id and c2.name = c1.name
    group by c1.id, c2.id
)
select
    id, diff, count(*) as cnt
from cte2
group by id, diff
order by id, diff

I'm supposing that your table have no duplicates, you can eliminate them beforehand.

sql fiddle demo

update
I don't know if it'll help you, bu take a look at this question PostgreSQL, find strings differ by n characters, I've asked it to try to help you, check Erwin Brandstetter answer there.

I've created an sql fiddle demo with different methods for you, looks like using levenshtein is fastest one, but it's not much faster than your original method.

Community
  • 1
  • 1
Roman Pekar
  • 107,110
  • 28
  • 195
  • 197