2

My question is a continuation of this one. Basically, I have a table of words like so:

HAT18178_890909.098070313.1
HAT18178_890909.098070313.2
HAT18178_890909.143412462.1
HAT18178_890909.143412462.2

For my purposes, I do not need the terminal .1 or .2 for this set of names. I can manually write the following regex (using Python syntax):

r = re.compile('(.*\.\d+)\.\d+')

However, I cannot guarantee that my next set of names will have a similar structure where the final 2 characters will be discardable - it could be 3 characters (i.e. .12) and the separator could change as well (i.e. . to _).

What is the appropriate way to either explicitly learn a regex or to determine which characters are unnecessary?

Community
  • 1
  • 1
learner
  • 1,895
  • 2
  • 19
  • 21
  • 3
    what do you mean, "learn a regex"? Learn regex syntax, or "what does this particular usage of a regex mean"? Regex is a language, and you learn the rules. But knowing the rules doesn't mean you'll suddenly be able to slam out the regex equivalent of "shall I compare thee to a summer's day?" – Marc B Dec 08 '14 at 17:09
  • 1
    How do you know that you don't need the `.1`? Whatever criteria you used to figure that out will be relevant in your learning algorithm. – Kevin Dec 08 '14 at 17:09
  • @MarcB: From the previous question, it *looks* like OP wants some kind of machine learning. – Kevin Dec 08 '14 at 17:10
  • @Kevin yes, that's right. The point is to learn the common pattern in the strings that I have encapsulated in the (Python) regex. – learner Dec 08 '14 at 17:15
  • Split on punctuation. –  Dec 08 '14 at 17:16
  • @MarcB - the question isn't 'wutz a regex' or 'n e have regex plz'. I specifically want to know if someone has worked an algorithm for taking in a set of strings and learning the different levels of patterns common to them. I can code my perception of the patterns just fine; I am wondering if a computer can learn what I see. – learner Dec 08 '14 at 17:25
  • sure, it's possible. easy? probably not. "neural nets" and whatever other AI-related buzzwords. you can whip up an algorithm to figure out the differences between a set of strings and probably have it build a regex for you. – Marc B Dec 08 '14 at 17:29

2 Answers2

1

It's an interesting problem.

X                                  y
HAT18178_890909.098070313.1        HAT18178_890909.098070313
HAT18178_890909.098070313.2        HAT18178_890909.098070313
HAT18178_890909.143412462.1        HAT18178_890909.143412462 
HAT18178_890909.143412462.2        HAT18178_890909.143412462

The problem is that there is not a single solution but many.

Even for a human it is not clear what the regex should be that you want.

Based on this data, I would think the possibilities to learn are:

Just match a fixed width of 25: .{25}

Fixed first part: HAT18178_890909.

Then:

There's only 2 varying numbers on each single spot (as you show 2 cases). So e.g. [01] (either 0 or 1), [94] the next spot and so on would be a good solution.

The obvious one would be \d+

But it could also be \d{9}

You see, there are multiple correct answers.

These regexes would still work if the second point would be an underscore instead.

My conclusion:

The problem is that it is much more work to prepare the data for machine learning than it is to create a regex. If you want to be sure you cover everything, you need to have complete data, so then a regex is probably less effort.

PascalVKooten
  • 20,643
  • 17
  • 103
  • 160
-1

You could split on non-alphanumeric characters;

[^a-zA-Z0-9']+

That would get you, in this case, few strings like this:

HAT18178
890909
098070313
1

From there on you can simply discard the last one if that's never necessary, and continue on processing the first sequences

Ieuan
  • 1,140
  • 1
  • 12
  • 27