3

I have a large table and I need to check for similar rows. I don't need all column values be the same, just similar. The rows must not be "distant" (determined by a query over other table), no value may be too different (I have already done the queries for these conditions) and most other values must be the same. I must expect some ambiguity, so one or two different values shouldn't break the "similarity" (well, I could get better performance by accepting only "completely equal" rows, but this simplification could cause errors; I will do this as an option).

The way I am going to solve this is through PL/pgSQL: to make a FOR LOOP iterating through the results of previous queries. For each column, I have an IF testing whether it differs; if yes, I increment a difference counter and go on. At the end of each loop, I compare the value to a threshold and see if I should keep the row as "similar" or not.

Such a PL/pgSQL-heavy approach seems slow in comparison to a pure SQL query, or to an SQL query with some PL/pgSQL functions involved. It would be easy to test for rows with all but X equivalent rows if I knew which rows should be different, but the difference can occur at any of some 40 rows. Is there any way how to solve this by a single query? If not, is there any faster way than to examine all the rows?

EDIT: I mentioned a table, in fact it is a group of six tables linked by 1:1 relationship. I don't feel like explaining what is what, that's a different question. Extrapolating from doing this over one table to my situation is easy for me. So I simplified it (but not oversimplified it - it should demonstrate all the difficulties I have there) and made an example demonstrating what I need. Null and anything else should count as "different". No need to make a script testing it all - I just need to find out whether it is possible to do in any way more efficient than I thought about.

The point is that I don't need to count rows (as usual), but columns.

EDIT2: previous fiddle - this wasn't so short, so I let it here just for archiving reasons.

EDIT3: simplified example here - just NOT NULL integers, preprocessing omitted. Current state of data:

select * from foo;
     id | bar1 | bar2 | bar3 | bar4 | bar5 
    ----+------+------+------+------+------
      1 |    4 |    2 |    3 |    4 |   11 
      2 |    4 |    2 |    4 |    3 |   11 
      3 |    6 |    3 |    3 |    5 |   13 

When I run select similar_records( 1 );, I should get only row 2 (2 columns with different values; this is within limit), not 3 (4 different values - outside the limit of two differences at most).

Community
  • 1
  • 1
Pavel V.
  • 2,653
  • 10
  • 43
  • 74
  • 3
    Please post the definition of the table(s) (as `create table`) some sample data and the expected output. Sounds as if you might be looking for something like `sum((col_1 <> col_2)::int)` –  Aug 18 '14 at 08:27
  • paste table definition, paste You query/Queries – Andrzej Reduta Aug 18 '14 at 08:33
  • @a_horse_with_no_name: to be honest, it is not a single table, but six tables connected with 1:1 relationships, and the SQL queries are quite big, definitely too big to insert into a SO question. I thought for your SUM suggestion for a while and I don't think it's the way (but maybe I just misunderstand something). Anyway, I'll try to edit my question soon, with some link to a SQLFiddle or something like that. – Pavel V. Aug 18 '14 at 09:21
  • Another edit. The first fiddle wasn't clear; I hope it is clear now. – Pavel V. Aug 18 '14 at 13:57
  • Do you know the acronym SSCCE? [Short, Self Contained, Correct (Compilable), Example](http://sscce.org/). I am sure your real life task is complex and you have to consider many things. But that is hardly relevant for your **question** here. Reduce your example to the thing you actually want to ask and remove all other distractions. – Erwin Brandstetter Aug 18 '14 at 13:58

2 Answers2

1

In instead of a loop to compare each row to all the others do a self join

select f0.id, f1.id
from foo f0 inner join foo f1 on f0.id < f1.id
where
    f0.bar1 = f1.bar1 and f0.bar2 = f1.bar2
    and
    @(f0.bar3 - f1.bar3) <= 1 
    and
    f0.bar4 = f1.bar4 and f0.bar5 = f1.bar5
    or
    f0.bar4 = f1.bar5 and f0.bar5 = f1.bar4
    and
    @(f0.bar6 - f1.bar6) <= 2
    and
    f0.bar7 is not null and f1.bar7 is not null and @(f0.bar7 - f1.bar7) <= 5 
    or
    f0.bar7 is null and f1.bar7 <= 3
    or
    f1.bar7 is null and f0.bar7 <= 3
    and
    f0.bar8 = f1.bar8
    and
    @(f0.bar11 - f1.bar11) <= 5
;
 id | id 
----+----
  1 |  4
  1 |  5
  4 |  5
(3 rows)

select * from foo;
 id | bar1 | bar2 | bar3 | bar4 | bar5 | bar6 | bar7 | bar8 | bar9 | bar10 | bar11 
----+------+------+------+------+------+------+------+------+------+-------+-------
  1 | abc  |    4 |    2 |    3 |    4 |   11 |    7 | t    | t    | f     |  42.1
  2 | abc  |    5 |    1 |    6 |    2 |    8 |   39 | t    | t    | t     |  19.6
  3 | xyz  |    4 |    2 |    3 |    5 |   14 |   82 | t    | f    |       |    95
  4 | abc  |    4 |    2 |    4 |    3 |   11 |    7 | t    | t    | f     |  42.1
  5 | abc  |    4 |    2 |    3 |    4 |   13 |    6 | t    | t    |       |  37.7

Are you aware that the and operator has priority over or? I'm asking because it looks like the where clause in your function is not what you want. I mean in your expression it is enough to f0.bar7 is null and f1.bar7 <= 3 to be true to include the pair

Clodoaldo Neto
  • 118,695
  • 26
  • 233
  • 260
  • So my fiddle was misleading :-( What you reproduce is just my "preprocessing" - what I was asking about was how to count rows with slightly different values (within the bounds given by this filter). I'll finish the fiddle and insert the part whose substitute I'm asking for. – Pavel V. Aug 18 '14 at 13:20
1

To find rows that only differ on a given maximum number of columns:

WITH cte AS (
   SELECT id
         ,unnest(ARRAY['bar1', 'bar2', 'bar3', 'bar4', 'bar5']) AS col  -- more
         ,unnest(ARRAY[bar1::text, bar2::text, bar3::text
                     , bar4::text, bar5::text]) AS val -- more
   FROM   foo
   )
SELECT b.id, count(a.val <> b.val OR NULL) AS cols_different
FROM   (SELECT * FROM cte WHERE id =  1) a
JOIN   (SELECT * FROM cte WHERE id <> 1) b USING (col)
GROUP  BY b.id
HAVING count(a.val <> b.val OR NULL) < 3 -- max. diffs allowed
ORDER  BY 2;

I ignored all the other distracting details in your question.

Demonstrating with 5 columns. Add more as required.

If columns can be NULL you may want to use IS DISTINCT FROM instead of <>.

This is using the somewhat unorthodox, but handy parallel unnest(). Both arrays must have the same number of elements to work. Details:

SQL Fiddle (building on yours).

Community
  • 1
  • 1
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • 1
    Wow, I half expected that hours spent on asking and improving this question will result just in "this is not possible with SQL", and you not only understood what I was asking for, but proved that it is possible. +1 and accepted :-) What's more, big 'thank you' for making me try to learn SSCCE in practice! – Pavel V. Aug 18 '14 at 14:55