0

I have about 50 XML feeds I need to parse and sort. I have done that using nokogiri, it parses the XML feeds on page load and creates a hash which I ilerate through. But is is really slow. Therefor I am looking for better solution.

Solutions I have thought:

  1. Create a cron job that creates a static XML feed with all the 50 feeds parsed and sorted. Parse this XML feed with JS or nokogiri. Which is faster to parse it on User site or server side?

  2. Break somehow the cron job XML feed up in parts for pageination.. The feed have for example 200-500 items and I only need to show the user about 8 items pr. page..

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
Rails beginner
  • 14,321
  • 35
  • 137
  • 257
  • 1
    I'd guess that it's not the parsing that's slow, but fetching the feeds for parsing. Have you profiled your app? What's the actual bottleneck? Also, "slow" and "faster" - mean very different things to different people. – Sergio Tulentsev Jan 24 '13 at 11:08
  • Now I have testet with about 20 (external) XML feeds and the loading time for the page was 14-30 s.. – Rails beginner Jan 24 '13 at 11:14
  • This question is related to http://stackoverflow.com/q/14459907/128421. – the Tin Man Jan 27 '13 at 06:46
  • Unless you own those feeds, don't hit them every two hours unless that is the refresh rate for the feed, set by the author. Some feeds only change weekly or monthly, so hitting them every two hours wastes CPU and bandwidth needlessly. See "[RSS: refresh rate?](http://stackoverflow.com/questions/6406928/rss-refresh-rate)" and "[RSS feed: how to recommend an update interval?](http://stackoverflow.com/questions/6389255/rss-feed-how-to-recommend-an-update-interval/6394390#6394390)". – the Tin Man Jan 27 '13 at 06:53

1 Answers1

5

it parses the XML feeds on page load

Really bad idea. Unless you need super-fresh information and are willing to sacrifice some machine resources for it.

Fetch/parse them in a background process. Store results in a db (or file, whatever works) and serve your local content. This will be much faster.

Parse them in background even if they change very frequently. This way you don't burn CPU and load network by having several web workers do exactly the same work.

Sergio Tulentsev
  • 226,338
  • 43
  • 373
  • 367
  • If I create a background job, that creates 1 XML file that may have 200-500 items and I want to paginate only 8 items pr. page. How should I then break up the file so it loads faster.. – Rails beginner Jan 24 '13 at 11:37
  • Is it faster to break up the file or just load the file..? – Rails beginner Jan 24 '13 at 11:39
  • After you've parsed the file, you can store it in a db (as separate items). There pagination is a solved problem – Sergio Tulentsev Jan 24 '13 at 11:39
  • So I should store 200-500 items in DB? and do a background job every 2 hour? – Rails beginner Jan 24 '13 at 11:41
  • What kind of XML is it? Like RSS feed? – Sergio Tulentsev Jan 24 '13 at 11:42
  • It is Wordpress RSS feeds. – Rails beginner Jan 24 '13 at 11:42
  • So, blog feed. It has known structure (title, body, etc). You can create a db table and save all those pieces of information in corresponding columns. That's how I would do it. – Sergio Tulentsev Jan 24 '13 at 11:45
  • But then I do need to delete the entire table in DB and create 500 new items every 2 hour.. – Rails beginner Jan 24 '13 at 11:46
  • I just created 1 XML feed from all the other 50 feeds and parsed that feed with nokogiri. Very fast indeed. Load Time: 0.156009 – Rails beginner Jan 24 '13 at 12:39
  • That proves what Sergio said. Parsing is fast. Fetching is slow. I would have a background process scan the blogs and insert *new posts* into a DB (I wouldn't blow away the entire thing and start from scratch each time). – Mark Thomas Jan 24 '13 at 12:50
  • @Railsbeginner, "But then I do need to delete the entire table in DB and create 500 new items every 2 hour". No, you only delete expired/changed entries. You look to see if they have expired by looking to see what the refresh information for the feed says to use, grab the feed, look for the article in your table and see if its the same. If it is you are done. If it isn't update it. If there are articles in the DB for that URL that weren't in the current feed delete the old (expired) entries. – the Tin Man Jan 28 '13 at 18:43