2

I am working on building a catalog of products based on OpenIcecat's (Icecat Open Catalog) and I am looking for advice from someone who may have experience with this, or possibly experience with another similar service (like C-Net maybe).

My question is, what is a good model for populating a product catalog's database?

Here is what I have so far...

  1. I GET the XML feed of the entire catalog
  2. I extract the data about the products that I need based on the Category ID
  3. At this point, I inserted all the data into a table, so now I have a table for like 'Printer cats', this contains the URL to the images and the id for the XML for each product in the category...Easy enough

Here is where I run into question/concern... I find it easy to adhoc a script to use a GET request for each XML file and image...then could dump them into directories, but Icecat does NOT want you to rip very large amounts. My categories contain thousands (over 40k for instance) of products.

What I feel I need to do is GET the XML for a product and grab the image and store them. I feel this way because it's an obvious solution and thats what the client keeps asking for...doesn't mean that it's correct though. So, then I could parse the individual XML to extract the description, SKU, etc. to a table so I can build the catalog, like for use with Magento, later adding/changing etc. as needed (prices, related product, etc.) Seems easy enough, but after around 3-4k GET requests or so I get booted because I'm ripping to much data, once I have the entire catalog (my catalog of wanted categories) then it will be easy enough to grab the update files (XML..and small in comparison) and make changes accordingly...this would be a great point to be at, but first need to get all the data and build the product table(s) first.

So here is what I kick around...

One idea is to get the data in real time as needed, but this is not desired by the client or myself. The client wants the catalog, understandable...and I notice that real time adds a performance hit and does not plug in to (easily) many solutions. But, expanding on the 'real-time' idea...use real time GET of XML data, and then store the data as it comes in with some logic like 'if it's not present locally...go get it and then store it; if it is present locally then check if it is up-to-date info...if not update it'...of course if I'm gonna check if it's up-to-date then there really is no point in storing the data because I'm doing a request every time no matter what...may as well just fetch it and throw it away, which seems inefficient.

-or-

Maybe everything is in real time: The products are fetched and displayed in real time, when the admin views products for manipulation it is presented in real-time, etc. Always grabbing whats needed in real time based on the meta-data that is in the database that was (is) already populated from the 'main' catalog file...that describes the entire catalog available from Icecat, but this don't plug into many solutions and will take a performance hit, plus some hosts wont let us GET anyhow...so many limitations here, but sounds like an awsome solution to be sure you always have super current info (which is not needed here though)

Here is where I am sort of headed already...

I have the meta-data based on the main catalog (over 500K items). I have already populated tables based on desired categories...now I am kind of headed towards this: Building an app (tool) that will better refine, such as a single category, what I am working with. Then produce a job 'use category ID and get all XML feeds'...then 'use cat.ID (probably same again) and then fetch images'...then, take same Cat. ID and build products by grabbing SKU, Desc., Image filename, etc. and build a catalog. At this point in the workflow I have all info and can use SKU (or whats needed) to grab price, etc. from other feed, manipulate descriptions, rename images if need (SEO) or whatever.

Then I will need to build a model for updating prices and shipping weights from a different feed...Synnex in this case, but seems much easier because shipping and price should be real-time...so different story and much less data at once, only whats in the cart I'm thinking.

Still not sure how to go about doing this..supposedly others have built a catalog like this for the same client by ripping the Icecat repository, but never make/provide tools for future manipulation etc...which is where I am headed. Plus the old catalog is very old/stale and I have never seen 'proof' that they actually did rip the data and build a catalog, not the full set anyhow.

OK, to help with the confusion...

The source I am using has a repository of over 600,000 products in many categories. I only need about 45,000 products (over several categories). As it is, it takes hours to download the xml file for each, like around 1000 per hour (we know this from past experience).

Part of the problem is that not every XML file is exactly the same, and we need different information from different categories. These requirements are most likely going to change (probably more at first). So I cannot just have a single schema to store all of them. Once the 45,000 (or so) files are downloaded, we only have to get the changes/updates in the future. So what I am really trying to do is build a local repository of only the categories that we need, so we can work with them more efficiently. We don't plan to use the related categories right away either, so I want the files locally for when we go back to do that too.

rhaag71
  • 519
  • 7
  • 22
  • what you talking about seems product and category feeds or rather product feeds for your system m i correct? – Umesh Awasthi Nov 28 '11 at 16:34
  • Sorry, I've been watching for 'answers' but didn't catch the comment...Yes, you are correct, it is ultimately a product and category feed. Where I am at now is that I have the categories and I am trying to populate a DB table with products in a single category at a time. I am thinking that I can reduce the load by processing a single category at a time, plus I am planning on keeping track of products that are already populated...sort of state tracking, so I can stop/start...picking up where it was left off and to be more efficient, no sense is grabbing data that I already have. – rhaag71 Nov 30 '11 at 21:57
  • we use such thing for our e-commerce application but taking help of some messaging system where these system provide us product feeds and we tend to import them in to our DB and later on we only get update feeds so once the product in our system we always go using update mode.Product feed also incldes information of category/categories which we use to assign product to those categories.Currently we use BizTalk and Apache Camel (Experimental) for this purpose.Let me know of this help you – Umesh Awasthi Dec 01 '11 at 00:59
  • That does help, because I makes me think I am on the right track. I want to have the products, then update when needed. My challenge right now is getting the 'First Time' feeds of the catalog. Currently, I am having trouble deciding how to deal with the data. Not every feed is the same, printers may have data that another product wont. I am thinking about grabbing the data, serializing it and storing it in the database so that I can work with it later. This does not seem right, but at least I will have a repository of the data locally to work with. – rhaag71 Dec 01 '11 at 01:04
  • Honestly this approach does not seems good to me.Your Database should contains only the data ready to use rather than a raw one.What exactly you planning to do once data is in your local repository? – Umesh Awasthi Dec 01 '11 at 01:12
  • I am trying to overcome the initial 'getting' of the data. Not every XML file is the same, so I would have to create many, many normalized tables (for each category) and extract the data differently for each category. If I need something later, I don't want to have to go back and get it remotely, because I'll need the entire file for one peice of information. I may be thinking about this all wrong, but I also want 'stateful' progress in case I have to stop and start again when populating the online (store) catalog. I need to get the description, related items, image (or image URL), etc. – rhaag71 Dec 01 '11 at 01:23
  • Right now all I need is the description, image information, features, related products information, etc. But, like I said...if I need more then I will have to download all of the files again, I cant just get the little piece I may need to add. I thought using the database might be better than writing them to the filesystem..thats what the last guy was trying to do and it doesn't work very well – rhaag71 Dec 01 '11 at 01:26
  • ok,but your these lines `create many, many normalized tables ` made me more confused :).What platform are you using for your Product Information management? – Umesh Awasthi Dec 01 '11 at 01:26
  • :-) Sorry, I don't mean to be confusing...this is confusing me too. In the end it's going to go into Magento. I am looking at PIMCore to help maintain the products, but for now we just need to build the product list that we are getting from Icecat. We will be getting other products from other sources later. I guess what I am working on is part of the PIM system...just for this source though. Here...I'm going to edit my question at the bottom to explain :) – rhaag71 Dec 01 '11 at 01:46
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/5484/discussion-between-umesh-awasthi-and-rhaag71) – Umesh Awasthi Dec 01 '11 at 01:52

2 Answers2

2

did you ever get this resolved ? have you tried to install http://sourceforge.net/projects/icecatpim/ it's very difficult to get going but it allows you to download the product database into a local mysql database, once you have the local db, you could write a cron script to periodically update the db as needed. Then access/manipulate the data using your own program.

there is also this project which could help http://code.google.com/p/icecat-import/ it has code to write data to a local db, it was originally designed to download the full db but there is a lot of data missing.

I've posted a patch for a innodb version of the database. http://sourceforge.net/tracker/?func=detail&aid=3578405&group_id=312187&atid=1314070 I hope that helps. there are a lot of bad references in the database that you have to watch out for,

here is a cheat sheet for checking and setting db referential integrity.

//# is the db holding refrential integrity.
SELECT * FROM Aclass a LEFT JOIN Bclass b ON a.column=b.column WHERE b.column IS NULL;

//# Set NULL on update
UPDATE Aclass a LEFT JOIN Bclass b ON a.column=b.column SET a.column=NULL WHERE b.column   IS NULL; 

//# Cascade on delete 
DELETE FROM A WHERE id NOT IN (SELECT DISTINCT B.id FROM B);
deefactorial
  • 293
  • 1
  • 12
  • I know this is over a year after your answer but.. I did mess with the Icecat PIM (and it seems like it gets better since I first asked this) but (as you stated) it's difficult to work with. At the time it seemed to have some bugs too. We have since utilized another developer's assistance who seemed to have spent a lot of time building some nice PHP scripts to deal with Icecat. It imports on the fly and then saves what it gets for the future (sort of cache like). – rhaag71 Feb 20 '14 at 21:23
  • 1
    Can you provide us with a link to a repository or page so we can download the app to import the data? – Giovanni Bitliner Apr 14 '14 at 18:58
0

Never see a so hard application to install like IcePIM.
Found the php script:

http://snipplr.com/view/33098/icecat-product-specifications-drawer/

  • 1
    Please don't just put a link, please summarize what you find there. Is that a script to install it? Or what? – Andrew May 07 '14 at 20:41