I have a large table and I need to check for similar rows. I don't need all column values be the same, just similar. The rows must not be "distant" (determined by a query over other table), no value may be too different (I have already done the queries for these conditions) and most other values must be the same. I must expect some ambiguity, so one or two different values shouldn't break the "similarity" (well, I could get better performance by accepting only "completely equal" rows, but this simplification could cause errors; I will do this as an option).
The way I am going to solve this is through PL/pgSQL: to make a FOR LOOP iterating through the results of previous queries. For each column, I have an IF testing whether it differs; if yes, I increment a difference counter and go on. At the end of each loop, I compare the value to a threshold and see if I should keep the row as "similar" or not.
Such a PL/pgSQL-heavy approach seems slow in comparison to a pure SQL query, or to an SQL query with some PL/pgSQL functions involved. It would be easy to test for rows with all but X equivalent rows if I knew which rows should be different, but the difference can occur at any of some 40 rows. Is there any way how to solve this by a single query? If not, is there any faster way than to examine all the rows?
EDIT: I mentioned a table, in fact it is a group of six tables linked by 1:1 relationship. I don't feel like explaining what is what, that's a different question. Extrapolating from doing this over one table to my situation is easy for me. So I simplified it (but not oversimplified it - it should demonstrate all the difficulties I have there) and made an example demonstrating what I need. Null and anything else should count as "different". No need to make a script testing it all - I just need to find out whether it is possible to do in any way more efficient than I thought about.
The point is that I don't need to count rows (as usual), but columns.
EDIT2: previous fiddle - this wasn't so short, so I let it here just for archiving reasons.
EDIT3: simplified example here - just NOT NULL integers, preprocessing omitted. Current state of data:
select * from foo;
id | bar1 | bar2 | bar3 | bar4 | bar5
----+------+------+------+------+------
1 | 4 | 2 | 3 | 4 | 11
2 | 4 | 2 | 4 | 3 | 11
3 | 6 | 3 | 3 | 5 | 13
When I run select similar_records( 1 );
, I should get only row 2 (2 columns with different values; this is within limit), not 3 (4 different values - outside the limit of two differences at most).