Handling duplicate data when parsing huge XML feeds

Question

I'm writing a component which parses an xml feed with stock quotes and saves the result in a database. The problem is fairly straightforward, except that feed can not be read incrementallly. That is, there is no way to specify that you only want the X last quote changes or only changes newer than X minutes, say. I know the real problem is that the feed is stupid and that the provider should fix their stuff, but that is not an option atm.

The feed is a huge xml file which contains the 100000 last stock quotes for the provider. The feed is polled once every minute during which there is about 50-100 changed quotes. The rest is duplicate quotes which are read again and again and again.

During each poll of the feed, I parse all quotes (using lxml) to objects. Then, for each quote object, I check if the quote already exist in the database. If it does, I discard it and if it doesn't, I save it. This procedure is extremely wasteful since only about 0.1% is new data, there rest is duplicates. To optimize it a bit, I create a lookup table by querying the database once for quotes updated in the last X hours. The quotes are unique in the database on the (last_update, stock_id) key so this optimization reduces the number of queries by about 50%.

But there is still 50k db queries where each quote have to be checked individually if it exists or not which is very taxing on the database.

So what I'm looking for is ideas on how to make my feed parser faster. Maybe there is a way to diff the last fetched xml file with the new one?

score 1 · Accepted Answer · answered Apr 01 '11 at 08:50

Are the most recent items at the top or the bottom of the feed? If they are at the top then you could stop parsing when you have seen the first item which is already present in the database.

If the most recent items come last you could cache the quote keys and just look them up in memory and start hitting the database once you come to a non cached one. Or you could remember the last quote you put in the database and when parsing all the items you look for it and only hit the database for items after it.

score 1 · Answer 2 · edited May 23 '17 at 12:26

Your problem divides into two areas: 1) how to avoid parsing what you don't need to parse, and 2) how to avoid database operations that you don't need either.

If the quotes themselves are very small, you probably won't gain much from trying to solve (1). Otherwise, you could create a filter (using XSLT or SAX, for example) that would discard the quotes that you don't care about, and then do your full DOM parse on the rest.

To solve (2), diffing XML files is in general can be tricky because changes in whitespace in your XML document, all-too-common from some providers, can cause false positives, and you generally need something that analyzes the actual XML structure, not a simple textual line-by-line diff. If you don't think this will be a problem for you, there are several Stack Overflow topics you can explore, but I think they will also demonstrate that XML diffs are still a bit of a wooly area, particularly in the open source arena:

Another approach that could work would be to use local or distributed memory caching for speedy lookups of stuff that's already updated. You'd get the benefit of avoiding having to try and filter or diff your content, and you may readily be able to adapt your caching infrastructure for other use cases if you're building a long-term infrastructure. OTOH, creating a scalable distributed caching infrastructure is not a particularly cheap solution.

Handling duplicate data when parsing huge XML feeds

2 Answers2