1

A regex answers a yes or no question - does the string match the pattern.

I want to separate the no's into two categories:

  • invalid strings that are prefixes of valid strings
  • invalid strings that are not prefixes of valid

Here's an example (regular expression 01+2):

  • 012 is valid

  • 12 is invalid; it is not a prefix of a valid string

  • 01 is invalid; it is a prefix of a valid string: 012

Java can do this.

Can re do this? If not, is there a library that can make this distinction?

Jesus is Lord
  • 14,971
  • 11
  • 66
  • 97
  • 1
    Define “perfectly matched”. The pattern `foo` matches the string `The food is in the barn.`, for example, and it does so perfectly well. On the other hand, the pattern `these` fails to match that same string. Are you wanting something that stops partway through the pattern and tells you where it failed? – tchrist Apr 05 '12 at 17:03
  • There I made it more precise. I want to determine whether or not a failed regex match, as I defined in my question, is a prefix of some string, or not, in the language generated by the regular expression. – Jesus is Lord Apr 05 '12 at 17:11
  • Don’t use `match`. End of story. These things become trivial when you use the proper interface, and ridiculous if you don’t. And your idea of precision does not seem to correspond with my own. – tchrist Apr 05 '12 at 17:14
  • I made it even more precise. I believe if I made it more precise than it is already, I think I would make the question inaccessible to those without a background in automata, but who might thoroughly understand the concept of regular expressions and variations thereof in the context of practical programming languages. – Jesus is Lord Apr 05 '12 at 17:33
  • `s/is in the language/matches/g; s/is not in the language/doesn’t match/g` – tchrist Apr 05 '12 at 17:54
  • I am not sure what you mean by that. – Jesus is Lord Apr 05 '12 at 18:05
  • Alas, the lack of a proper Unix background rears its ugly head. That’s an editor command. – tchrist Apr 05 '12 at 18:15
  • If you want only certain pieces to match for sure, and others only optionally, then you need to construct your pattern accordingly as I have done in my answer. – tchrist Apr 05 '12 at 18:21

3 Answers3

5

I second the recommendation of regex. The module is simply fantastic.

Here's an example of fuzzy matching with regex:

import regex

# traditional matching - three digits
r = '(?:\d\d\d)'
print regex.findall(r, '1xx22yy333zz')
## ['333']

# fuzzy matching - three digits, allow at most 2 deletions
r = '(?:\d\d\d){d<3}'
print regex.findall(r, '1xx22yy333zz')
## ['1', '22', '333']

The {d<3} part basically says "if we add one or two chars to that, that would be a match" - the same as point 3. in your question.

See http://pypi.python.org/pypi/regex for more info (look for 'Approximate "fuzzy" matching').

georg
  • 211,518
  • 52
  • 313
  • 390
  • 1
    This does what I want. However it isn't part of what's included in a standard Python distribution, I believe. If someone can give me an answer that only involves standard libraries, I will award that as the accepted answer. If not, then yours takes it :) – Jesus is Lord Apr 05 '12 at 17:55
2

I can’t tell what you really want. Are you trying to understand incremental matches?

Perhaps you need to learn to use anchors properly. Just as you should never use the Java misnamed and deceptive matches method instead of its find method, you should probably eschew Python’s match method in favor of search, and for precisely the same reason.

Another possibility is to rewrite your pattern with optional portions that you can then inspect the success of.

Or perhaps you should look into the fuzzy matching support in Matthew Barnett’s replacement regex library, which you really should be using instead of the crufty old re.

I can’t tell what you’re realy asking, because you haven’t given examples of desired input and output.

EDIT

Perhaps you need nothing more complex than (?=.*(?:ab|bc)).*a?b?c?, or spaced out:

 (?x)
 (?= .* (?: ab | bc) )
 .*
 a? b? c?

If you put a, b, and c into recursible subgroups, you wouldn’t even have to repeat yourself.

tchrist
  • 78,834
  • 30
  • 123
  • 180
  • @thg435 No, it’s not alright, because that is **not** fuzzy matching. Fuzzy matching involves the Levenshtein distance. Please remove the inapplicable edit. You may supply your own answer if you wish. – tchrist Apr 05 '12 at 17:18
  • @tchrist: According to Wikipedia: "The Levenshtein distance between two strings is defined as the minimum number of edits needed to transform one string into the other, with the allowable edit operations being insertion, deletion, or substitution of a single character." I believe the syntax `{d<3}` means allow up to 2 deletions. Therefore the answer was an example of fuzzy matching, limited to 0, 1 or 2 deletions. Did you mean that since it forbade the other operations it was not an instance of pure fuzzy matching? – Jesus is Lord Apr 05 '12 at 17:54
  • @WordsLikeJared Yes, the `{d<3}` part corresponds to fuzzy matching; I hadn’t seen it previously. However, you really need to disabuse yourself of all this ʀᴇɢᴜʟᴀʀ ʟᴀɴɢᴜᴀɢᴇ jazz, because that has nothing to do with almost any current regex syntax or system, including both Java’s and Python’s. You need to learn how to get the existing system to do whatever it is you really want to do. Modern pattern matching systems are infinitely more powerful than all this ʀᴇɢᴜʟᴀʀ ʟᴀɴɢᴜᴀɢᴇ business; just learn how to use them. – tchrist Apr 05 '12 at 17:56
  • At my university they do not teach pattern matching in the programming courses. Most of my senior-level peers do not know how to use Java's Pattern class, for example, even though Java is the primary instructional language at my university. My university teaches automata, though, which is how I conceptualize things and why I used all that jargon. – Jesus is Lord Apr 05 '12 at 18:08
2

Correct me if I am wrong, I believe the original question was: for some regex and input string if the input string is a possible match (ie. adding more characters to the end of the string could make it a match).

Thus fuzzy matching isn't an answer because:

rgx = regex.compile('(abcdef){d<5}')

str = 'bcd'

will be a match as well

Boost C++ library (Also for Java/Javascript?) provides such an option: http://www.boost.org/doc/libs/1_31_0/libs/regex/doc/partial_matches.html

So does anyone know what is the best way to achieve this in Python?

user250765
  • 91
  • 1
  • 1
  • 4