I'm currently trying to process a number of data feeds that I have no control over, where I am using Regular Expressions in C# to extract information.
The originator of the data feed is extracting basic row data from their database (like a product name, price, etc), and then formatting that data within rows of English text. For each row, some of the text is repeated static text and some is the dynamically generated text from the database.
e.g
Panasonic TV with FREE Blu-Ray Player
Sony TV with FREE DVD Player + Box Office DVD
Kenwood Hi-Fi Unit with $20 Amazon MP3 Voucher
So the format in this instance is: PRODUCT with FREEGIFT.
PRODUCT and FREEGIFT are dynamic parts of each row, and the "with" text is static. Each feed has about 2000 rows.
Creating a Regular Expression to extract the dynamic parts is trivial.
The problem is that the marketing bods in control of the data feed keep on changing the structure of the static text, usually once a fortnight, so this week I might have:
Brand new Panasonic TV and a FREE Blu-Ray Player if you order today
Brand new Sony TV and a FREE DVD Player + Box Office DVD if you order today
Brand new Kenwood Hi-Fi unit and a $20 Amazon MP3 Voucher if you order today
And next week it will probably be something different, so I have to keep modifying my Regular Expressions...
How would you handle this?
Is there an algorithm to determine static and variable text within repeating rows of strings? If so, what would be the best way to use the output of such an algorithm to programatically create a dynamic Regular Expression?
Thanks for any help or advice.