26

i write program that parse text with regular expression. Regular expression should be obtained from user. I deside to use glob syntax for user input, and convert glob string to the regular expression internally. For example:

"foo.? bar*" 

should be converted to

"^.*foo\.\w\bar\w+.*"

Somehow, i need to escape all meaningful characters from the string, then i need to replace glob * and ? characters with apropriate regexp syntax. What is the most convinient way to do this?

PEZ
  • 16,821
  • 7
  • 45
  • 66
Evgeny Lazin
  • 9,193
  • 6
  • 47
  • 83
  • 1
    The regex looks a bit strange. Like: "^.*foo" could be written as "foo". And I think the globbing star translates to the regex ".*?". Where did the space in the search go? And \bar matches words starting with "ar". – PEZ Jan 15 '09 at 09:17

6 Answers6

65

no need for incomplete or unreliable hacks. there's a function included with python for this

>>> import fnmatch
>>> fnmatch.translate( '*.foo' )
'.*\\.foo$'
>>> fnmatch.translate( '[a-z]*.txt' )
'[a-z].*\\.txt$'
  • If you are using python to create the regex, then you should use python to compare using the regex because sed does not understand the trailing '\\Z(?ms)' which is actually output by fnmatch.translate. – Paul Dec 10 '20 at 04:24
  • 3
    It's such a wonderful feeling when you need a function, google for a snippet, and discover that very function is in a built-in library. This is why I love python. – Matthew Leingang Apr 01 '21 at 19:27
  • Unfortunately, `translate` only helps for single filename matches. It's no help for paths, because `re.match(fnmatch.translate('*.txt'), 'a/foo.txt')` will match. Users expect `glob('*.txt')` not to match subdir paths. See https://github.com/jaraco/zipp/issues/98 where this limitation manifest. – Jason R. Coombs Jul 13 '23 at 02:03
3

I'm not sure I fully understand the requirements. If I assume the users want to find text "entries" where their search matches then I think this brute way would work as a start.

First escape everything regex-meaningful. Then use non-regex replaces for replacing the (now escaped) glob characters and build the regular expression. Like so in Python:

regexp = re.escape(search_string).replace(r'\?', '.').replace(r'\*', '.*?')

For the search string in the question, this builds a regexp that looks like so (raw):

foo\..\ bar.*?

Used in a Python snippet:

search = "foo.? bar*"
text1 = 'foo bar'
text2 = 'gazonk foo.c bar.m m.bar'

searcher = re.compile(re.escape(s).replace(r'\?', '.').replace(r'\*', '.*?'))

for text in (text1, text2):
  if searcher.search(text):
    print 'Match: "%s"' % text

Produces:

Match: "gazonk foo.c bar.m m.bar"

Note that if you examine the match object you can find out more about the match and use for highlighting or whatever.

Of course, there might be more to it, but it should be a start.

PEZ
  • 16,821
  • 7
  • 45
  • 66
  • Thats right, but you need alsough replace ()|\ [] and other meaningful characters in serarch string – Evgeny Lazin Jan 15 '09 at 10:59
  • Additionally, you can't blindly replace `?` with `.` because the `?` might be in a character set. Consider `a[?]txt`: with the suggested searcher, it would match `a.txt` even though it should only match `a?txt`. – Jason R. Coombs Jul 13 '23 at 02:08
1

Jakarta ORO has an implementation in Java.

orip
  • 73,323
  • 21
  • 116
  • 148
1

I write my own function, using c++ and boost::regex

std::string glob_to_regex(std::string val)
{
    boost::trim(val);
    const char* expression = "(\\*)|(\\?)|([[:blank:]])|(\\.|\\+|\\^|\\$|\\[|\\]|\\(|\\)|\\{|\\}|\\\\)";
    const char* format = "(?1\\\\w+)(?2\\.)(?3\\\\s*)(?4\\\\$&)";
    std::stringstream final;
    final << "^.*";
    std::ostream_iterator<char, char> oi(final);
    boost::regex re;
    re.assign(expression);
    boost::regex_replace(oi, val.begin(), val.end(), re, format, boost::match_default | boost::format_all);
    final << ".*" << std::ends;
    return final.str();
}

it looks like all works fine

Evgeny Lazin
  • 9,193
  • 6
  • 47
  • 83
1

jPaq's RegExp.fromWildExp function does something similar to this. The following is taken from the example that is on the front page of the site:

// Find a first substring that starts with a capital "C" and ends with a
// lower case "n".
alert("Where in the world is Carmen Sandiego?".findPattern("C*n"));

// Finds two words (first name and last name), flips their order, and places
// a comma between them.
alert("Christopher West".replacePattern("(<*>) (<*>)", "p", "$2, $1"));

// Finds the first number that is at least three numbers long.
alert("2 to the 64th is 18446744073709551616.".findPattern("#{3,}", "ol"));
Clarence Fredericks
  • 1,247
  • 1
  • 7
  • 4
0

In R, there's the glob2rx function included in the base distribution:

http://stat.ethz.ch/R-manual/R-devel/library/utils/html/glob2rx.html

nassimhddd
  • 8,340
  • 1
  • 29
  • 44