0

I have a online shop application and a database of around 1000 ITEMS.

ITEM{
  categories / up to 5  out of 60
  types / up to 2 out of 10
  styles / up to 2 out of 10
  rating / 0-5
}

Now I wont to create a comparison item-to-item with predefined conditions:
- At least one common category += 25points
- At least one common type += 25p.
- If first item has no styles += 0p.
- If no styles in common -= 10p. - For each point in rating difference -= 5p.

And store the result in a table. as item_to_item_similarity.score.

Now I made the whole thing with a nice and shiny PHP functions and classes ..
And a function to calculate and update all the relations.

In the test withs 20 items .. all went well. But when increased the test data to 1000 items .. resulting in 1000x1000 relations The server started complaining about script_time_out .. and out of memory :)

Indexes, transaction and pre-loading some of the data .. helped me half the way.

Is there a smarter way to compare and evaluate this type of data?

I was thinking to represent the related categories, styles etc.
as a set of IDs, possibly in some binary mask .. so that they can be easily compared (even in the SQL ?) with out the need to create classes, and loops trough arrays millions of times.

d.raev
  • 9,216
  • 8
  • 58
  • 79

1 Answers1

1

I know this isn't the best but, what about the following:

  1. You have your table which links the two items, a timestamp, and has their score. This table will hold the 1,000,00 records.
  2. You have a CRON script, which runs every 15 mins.
  3. First time cron runs, it creates the 1,000,000 rows. No scores are calculated. This can be done by counting rows in table. If count==0 then it's first run
  4. Second run and thereafter runs, it selects 1000 records, and calculates their score and updates the timestamp. It should select 1000 records ordered by the timestamp, so that it selects 1000 oldest records.
  5. Leave this to run in the background, every 15 mins or so. Will take like 10 days to run in total and calculate all the scores.
  6. Whenever you update a product, you need to reset the date on the linking table, so that when the cron runs it recalculates the score for all rows that mention that item.
  7. When you create a new product, you must create the linking rows, so it has to add a row for each other item

Personally, I'd consider using a different method altogether, there are plenty of algorithms out there you just have to find one which applies to this scenario. Here is one example:

Also, here is the Jaccard Index written in PHP which may be more efficient that your current method

Community
  • 1
  • 1
Jamie Bicknell
  • 2,306
  • 17
  • 35