0

I have this table:

itemA itemB relation

1     2     0.4
1     3     0.8
2     1     0.4
2     3     0.6
3     1     0.8
3     2     0.6

It's 10m rows table and counting, and as you see, there's lots of duplicate information. I use this table like this:

SELECT itemB WHERE itemA=1 ORDER BY relation DESC LIMIT 1
// in this case - 3

Only in real life my limit is not 1 but 500, so I get the 500 items that have the highest 'relation' value with the item I query for.

I'm considering narrowing down the table and remove duplicates:

itemA itemB relation

1     2     0.4
1     3     0.8
2     3     0.6

this will reduce table size by 50%. but then I need a query that goes like this -

"look for item x in itemA AND itemB, and when you find it, get its pair. when you have all pairs, sort them by their relation with item x, and select the top 500."

A. Do you think this will actually make the table usage more efficient?

B. How should this query look like?

Dharman
  • 30,962
  • 25
  • 85
  • 135
Roy
  • 1,307
  • 1
  • 13
  • 29
  • What do you mean get "it's pair"? Are you talking about a matching entry? Can't you store a count column for duplicates? Otherwise you might want to [index](http://stackoverflow.com/questions/1108/how-does-database-indexing-work) your database to improve speed. – benedict_w Apr 02 '12 at 08:02

1 Answers1

2

A. The best way to determine this is to set up a couple of test tables and benchmark some potential queries.

B. There are a number of ways you could do it. One might be

SELECT DISTINCT LEAST(itemA,itemB),GREATEST(itemA,itemB),relation 
FROM `table`
WHERE itemA = 1
   OR itemB = 1
ORDER BY relation DESC
LIMIT 500

Or possibly

SELECT otherItem,relation FROM
(
    SELECT itemB as otherItem,relation
    FROM `table`
    WHERE itemA = 1
) UNION DISTINCT (
    SELECT itemA as otherItem,relation
    FROM `table`
    WHERE itemB = 1
) as combined_items
ORDER BY relation DESC
LIMIT 500
liquorvicar
  • 6,081
  • 1
  • 16
  • 21