5

I'm currently trying to process a number of data feeds that I have no control over, where I am using Regular Expressions in C# to extract information.

The originator of the data feed is extracting basic row data from their database (like a product name, price, etc), and then formatting that data within rows of English text. For each row, some of the text is repeated static text and some is the dynamically generated text from the database.

e.g

Panasonic TV with FREE Blu-Ray Player

Sony TV with FREE DVD Player + Box Office DVD

Kenwood Hi-Fi Unit with $20 Amazon MP3 Voucher

So the format in this instance is: PRODUCT with FREEGIFT.

PRODUCT and FREEGIFT are dynamic parts of each row, and the "with" text is static. Each feed has about 2000 rows.

Creating a Regular Expression to extract the dynamic parts is trivial.

The problem is that the marketing bods in control of the data feed keep on changing the structure of the static text, usually once a fortnight, so this week I might have:

Brand new Panasonic TV and a FREE Blu-Ray Player if you order today

Brand new Sony TV and a FREE DVD Player + Box Office DVD if you order today

Brand new Kenwood Hi-Fi unit and a $20 Amazon MP3 Voucher if you order today

And next week it will probably be something different, so I have to keep modifying my Regular Expressions...

How would you handle this?

Is there an algorithm to determine static and variable text within repeating rows of strings? If so, what would be the best way to use the output of such an algorithm to programatically create a dynamic Regular Expression?

Thanks for any help or advice.

Community
  • 1
  • 1
waveydavey
  • 51
  • 2
  • 1
    An algorithm would get you a best guess; marketing guys could tell you for sure. Do you have a direct line to them? Would it be possible to give *them* an ability to change *your* static template? – Sergey Kalinichenko Jan 26 '12 at 16:05
  • No. I am but a humble affiliate marketer :-) The originator of the feed is a corporation who won't help with that level of customization. All the processing needs to be on my end. Thanks for the comment. – waveydavey Jan 26 '12 at 16:08
  • 1
    Do you know what the PRODUCT and FREEGIFT are for at least one row? Can you always grab that row and then extract the text surrounding the PRODUCT and FREEGIFT for later filtering? – Jeremy Stein May 22 '12 at 20:30
  • You could probably do something using a [longest common substring](http://en.wikipedia.org/wiki/Longest_common_substring_problem) algorithm, but even if you do manage to identify the dynamic/static components of the text, how would you know which one is the FREEGIFT, and which is the PRODUCT? – Simon MᶜKenzie Aug 23 '12 at 04:51

2 Answers2

3

This code isn't perfect, it certainly isn't efficient, and it's very likely to be too late to help you, but it does work. If given a set of strings, it will return the common content above a certain length.

However, as others have mentioned, an algorithm can only give you an approximation, as you could hit a bad batch where all products have the same initial word, and then the code would accidentally identify that content as static. It may also produce mismatches when dynamic content shares values with static content, but as the size of samples you feed into it grows, the chance of error will shrink.

I'd recommend running this on a subset of your data (20000 rows would be a bad idea!) with some sort of extra sanity checking (max # of static elements etc)

Final caveat: it may do a perfect job, but even if it does, how do you know which item is the PRODUCT and which one is the FREEGIFT?

The algorithm

  1. If all strings in the set start with the same character, add that character to the "current match" set, then remove the leading character from all strings
  2. If not, remove the first character from all strings whose first x (minimum match length) characters aren't contained in all the other strings
  3. As soon as a mismatch is reached (case 2), yield the current match if it meets the length requirement
  4. Continue until all strings are exhausted

The implementation

private static IEnumerable<string> FindCommonContent(string[] strings, int minimumMatchLength)
{
    string sharedContent = "";

    while (strings.All(x => x.Length > 0))
    {
        var item1FirstCharacter = strings[0][0];

        if (strings.All(x => x[0] == item1FirstCharacter))
        {
            sharedContent += item1FirstCharacter;

            for (int index = 0; index < strings.Length; index++)
                strings[index] = strings[index].Substring(1);

            continue;
        }

        if (sharedContent.Length >= minimumMatchLength)
            yield return sharedContent;

        sharedContent = "";

        // If the first minMatch characters of a string aren't in all the other strings, consume the first character of that string
        for (int index = 0; index < strings.Length; index++)
        {
            string testBlock = strings[index].Substring(0, Math.Min(minimumMatchLength, strings[index].Length));

            if (!strings.All(x => x.Contains(testBlock)))
                strings[index] = strings[index].Substring(1);
        }
    }

    if (sharedContent.Length >= minimumMatchLength)
        yield return sharedContent;
}

Output

Set 1 (from your example):

FindCommonContent(strings, 4);
=> "with "

Set 2 (from your example):

FindCommonContent(strings, 4);
=> "Brand new ", "and a ", "if you order today"

Building the regex

This should be as simple as:

 "{.*}" + string.Join("{.*}", FindCommonContent(strings, 4)) + "{.*}";
=> "^{.*}Brand new {.*}and a {.*}if you order today{.*}$"

Although you could modify the algorithm to return information about where the matches are (between or outside the static content), this will be fine, as you know some will match zero-length strings anyway.

Simon MᶜKenzie
  • 8,344
  • 13
  • 50
  • 77
0

I think it would be possible with an algorithm , but the time it would take you to code it versus simply doing the Regular Expression might not be worth it.

You could however make your changing process faster. If instead of having your Regex String inside your application, you'd put it in a text file somewhere, you wouldn't have to recompile and redeploy everything every time there's a change, you could simply edit the text file.

Depending on your project size and implementation, this could save you a generous amount of time.

Alexandre
  • 4,382
  • 2
  • 23
  • 33