I'm writing a component which parses an xml feed with stock quotes and saves the result in a database. The problem is fairly straightforward, except that feed can not be read incrementallly. That is, there is no way to specify that you only want the X last quote changes or only changes newer than X minutes, say. I know the real problem is that the feed is stupid and that the provider should fix their stuff, but that is not an option atm.
The feed is a huge xml file which contains the 100000 last stock quotes for the provider. The feed is polled once every minute during which there is about 50-100 changed quotes. The rest is duplicate quotes which are read again and again and again.
During each poll of the feed, I parse all quotes (using lxml) to objects. Then, for each quote object, I check if the quote already exist in the database. If it does, I discard it and if it doesn't, I save it. This procedure is extremely wasteful since only about 0.1% is new data, there rest is duplicates. To optimize it a bit, I create a lookup table by querying the database once for quotes updated in the last X hours. The quotes are unique in the database on the (last_update, stock_id) key so this optimization reduces the number of queries by about 50%.
But there is still 50k db queries where each quote have to be checked individually if it exists or not which is very taxing on the database.
So what I'm looking for is ideas on how to make my feed parser faster. Maybe there is a way to diff the last fetched xml file with the new one?