0

I have a product table that is updated using CSV feeds from various suppliers. Each feed has its own table, however products can appear multiple times in the same supplier table, and in multiple supplier tables. Each product can only occur once in our main table though. I don't anticipate ever using more than about ten different supplier tables. Tables are updated at least daily, and at most every 6-8 hours, and read speeds are a much higher priority than write speeds. There are usually about 500,000 enabled products at any given time.

My first plan was to store the table name and primary key ID in that table for each product, then recalculate it during each update, but according to the responses here, having to do that is an indication that the database isn't designed correctly.

Using a view to combine these tables into a single virtual table seems like it'd help a lot with the organization. That way, I can just create a rule to make one column an SQL query, then index that column to increase search/read speed. The rules that determine where to pull supplier information from are not somewhat involved, and need to take country and price into account, as well as perhaps a few other things.

So I guess the question here is, is there a correct way of doing this? Or is it going to be messy no matter how I do it? Also, am I on the right track?

Community
  • 1
  • 1
virnovus
  • 129
  • 1
  • 12
  • 1
    in the supplier tables, do they all have the same columns? Can you have a coumn that references the priduct table, can you have a column that identifies the supplier? if the answer is yes the corect way is probably a single supplier table, or using sql inhertance to combine the several supplier tables, else it;s a really ugly cascaded union query. – Jasen Jan 06 '15 at 21:03
  • I would do it the other way round: create one single table with a supplier column containing **all** products. And a *view* for each supplier using triggers to allow inserting into the views. –  Jan 06 '15 at 23:15
  • @Jasen: The supplier inventory tables are each pulled in directly via COPY from CSV files. Most of the fields are the same, but there are some differences. We do occasionally need to query the supplier tables directly, so I don't want to merge them all into one. – virnovus Jan 06 '15 at 23:29
  • @a_horse_with_no_name: That's pretty close to how we're doing things now, however we've been running into a lot of difficulties in merging new data with our historical data. (Also, it's not always clear what's "historical".) We're storing a lot more data than we – virnovus Jan 07 '15 at 15:53

2 Answers2

1

Using a view unifying all your feed tables might well simplify the form of your queries, but you cannot index a view. (Well, in Oracle I think you can index a MATERIALIZED view, but that's a special case).

Structurally, I find it a bit suspect that you split your supplier feeds into separate tables; doing so may simplify and speed updates from the supplier feeds, and it is certainly the fastest alternative for queries against specific, individual feeds, but it's ugly for updating (recomputing?) the main table, and it is flatly unsound for supporting rows of the main table being related back to the particular supplier feed from which they were drawn.

If you need fast queries against the supplier feeds, independent of the main table, and you also need the main table to be related to a detail table containing supplier-specific information, then perhaps your best bet would be to maintain a physical auxiliary table as the UNION ALL of all the per-supplier tables (this requires those tables to have the same structure), each with a distinct supplier ID. In Oracle, you can automate that as a MATERIALIZED VIEW, but with most DBMSs you would need to maintain that table manually.

The auxilliary table can be indexed, can be joined to the main table as needed in queries, and can be queried fairly efficiently. If appropriate, it can be used to update the main table.

John Bollinger
  • 160,171
  • 8
  • 81
  • 157
  • 2
    Postgres has materialized views as well (although they don't updated automagically) –  Jan 06 '15 at 23:13
  • From my understanding, you can index the function that the view uses, no? In any case, the more I look into it, the more it seems like my best bet might be doing the linking outside of SQL. Probably in Rails. In the past, I've gotten huge performance increases by going around the Rails Activerecord API and setting up a lot of the SQL functionality directly, but perhaps this isn't one of those cases. – virnovus Jan 07 '15 at 01:25
  • 1
    No, you cannot index the defining query of a (normal) view. You can index that query's base tables, but that's quite a different thing, and unlikely to be as helpful. You can index a `MATERIALIZED` view because it has a *bona fide* table behind it; this is mostly equivalent to creating and managing a normal table of the query results. In particular, it does not track changes to the base tables as they happen, though it facilitates updating the backing table. – John Bollinger Jan 07 '15 at 14:30
1

Hmm, why not just create one product table that contains data from all suppliers? Have a field in that table that identifies which supplier. When you get your input feeds, update this one table rather than having a separate table for each supplier. If you're using COPY to import a CSV file into a db table, fine, but then the imported table is just a temporary work table. Promptly copy the data from there into the "real", unified table. Then the import table can be dropped or truncated, or more likely you keep it around for troubleshooting. But you don't use it within the program.

You should be able to copy from the import table to the unified table with a single insert statement. Even if the tables are large I'd expect that to be fast. It would almost surely be faster overall to do one mass insert for each import than to have a view that does a union on 10 tables and try to work with that. If the unified table has all the data from all suppliers plus a supplier field, then I don't see why you would ever need to query the raw import tables. Except, that is, for trouble-shooting problems with the import, but fine, so you keep them around for that. Unless you're constrained on disk space so that keeping what amounts to duplicates of every record is a problem, I'd think this would be the easy solution. If disk space is an issue, than drop the import table immediately after copying the data to the unified table, and keep the original raw import on backup media somewhere.

Jay
  • 26,876
  • 10
  • 61
  • 112
  • This is what I do right now. It's causing problems because we get new data sets several times a day that need to overwrite the old data sets that are already there. When everything is in the same table, this is somewhat of a nightmare. Also, suppliers organize their data differently, and some include optional data that we want to keep if it's available. Creating the columns anyway, whether or not they exists for all suppliers, ends up resulting in 80-100 columns, and that number will only increase as we add suppliers. It seems my solution will necessarily be messy, but I guess that's ok. – virnovus Jan 08 '15 at 22:28
  • If the new data on a feed replaces old data for that supplier, you could say `delete from products where supplier_id=@supplier; insert into products (supplier_id, product_name, whatever) select @supplier, product_name, whatever from product_import_foo` (where that last is one of the import tables), inside a transaction so no other process will see the deletes until after the inserts have happened. It's certainly possible that real life is more complicated, that you only delete some records but not all, that you have to preserve autoincrement id's, etc, but the principle would be there. – Jay Jan 09 '15 at 14:37
  • RE lots of columns: You'd have the same problem if you use a view. And if there's ever a situation where you want to present the user with a list of products that could come from multiple companies, or process a cross-company list in any way, I don't see how you'll avoid this issue. 100 columns is certainly a beast, but it beats having code that says "if the supplier is X, do this query, else if the supplier is Y, do a different query, etc" every time you reference a list of products. – Jay Jan 09 '15 at 14:38