Create regex from glob expression

Question

i write program that parse text with regular expression. Regular expression should be obtained from user. I deside to use glob syntax for user input, and convert glob string to the regular expression internally. For example:

"foo.? bar*"

should be converted to

"^.*foo\.\w\bar\w+.*"

Somehow, i need to escape all meaningful characters from the string, then i need to replace glob * and ? characters with apropriate regexp syntax. What is the most convinient way to do this?

The regex looks a bit strange. Like: "^.*foo" could be written as "foo". And I think the globbing star translates to the regex ".*?". Where did the space in the search go? And \bar matches words starting with "ar". — PEZ, Jan 15 '09 at 09:17

score 65 · Answer 1 · answered Oct 12 '09 at 17:07

65

no need for incomplete or unreliable hacks. there's a function included with python for this

>>> import fnmatch
>>> fnmatch.translate( '*.foo' )
'.*\\.foo$'
>>> fnmatch.translate( '[a-z]*.txt' )
'[a-z].*\\.txt$'

answered Oct 12 '09 at 17:07

If you are using python to create the regex, then you should use python to compare using the regex because sed does not understand the trailing '\\Z(?ms)' which is actually output by fnmatch.translate. – Paul Dec 10 '20 at 04:24
3

It's such a wonderful feeling when you need a function, google for a snippet, and discover that very function is in a built-in library. This is why I love python. – Matthew Leingang Apr 01 '21 at 19:27
Unfortunately, `translate` only helps for single filename matches. It's no help for paths, because `re.match(fnmatch.translate('*.txt'), 'a/foo.txt')` will match. Users expect `glob('*.txt')` not to match subdir paths. See https://github.com/jaraco/zipp/issues/98 where this limitation manifest. – Jason R. Coombs Jul 13 '23 at 02:03

PEZ · Answer 2 · 2009-01-15T13:32:44.453

I'm not sure I fully understand the requirements. If I assume the users want to find text "entries" where their search matches then I think this brute way would work as a start.

First escape everything regex-meaningful. Then use non-regex replaces for replacing the (now escaped) glob characters and build the regular expression. Like so in Python:

regexp = re.escape(search_string).replace(r'\?', '.').replace(r'\*', '.*?')

For the search string in the question, this builds a regexp that looks like so (raw):

foo\..\ bar.*?

Used in a Python snippet:

search = "foo.? bar*"
text1 = 'foo bar'
text2 = 'gazonk foo.c bar.m m.bar'

searcher = re.compile(re.escape(s).replace(r'\?', '.').replace(r'\*', '.*?'))

for text in (text1, text2):
  if searcher.search(text):
    print 'Match: "%s"' % text

Produces:

Match: "gazonk foo.c bar.m m.bar"

Note that if you examine the match object you can find out more about the match and use for highlighting or whatever.

Of course, there might be more to it, but it should be a start.

Thats right, but you need alsough replace ()|\ [] and other meaningful characters in serarch string — Evgeny Lazin, Jan 15 '09 at 10:59
Additionally, you can't blindly replace `?` with `.` because the `?` might be in a character set. Consider `a[?]txt`: with the suggested searcher, it would match `a.txt` even though it should only match `a?txt`. — Jason R. Coombs, Jul 13 '23 at 02:08

score 1 · Answer 3 · answered Jan 15 '09 at 07:44

1

Jakarta ORO has an implementation in Java.

answered Jan 15 '09 at 07:44

orip

73,323
21
116
148

score 1 · Answer 4 · answered Jan 15 '09 at 08:16

I write my own function, using c++ and boost::regex

std::string glob_to_regex(std::string val)
{
    boost::trim(val);
    const char* expression = "(\\*)|(\\?)|([[:blank:]])|(\\.|\\+|\\^|\\$|\\[|\\]|\\(|\\)|\\{|\\}|\\\\)";
    const char* format = "(?1\\\\w+)(?2\\.)(?3\\\\s*)(?4\\\\$&)";
    std::stringstream final;
    final << "^.*";
    std::ostream_iterator<char, char> oi(final);
    boost::regex re;
    re.assign(expression);
    boost::regex_replace(oi, val.begin(), val.end(), re, format, boost::match_default | boost::format_all);
    final << ".*" << std::ends;
    return final.str();
}

it looks like all works fine

score 1 · Answer 5 · answered Mar 14 '11 at 16:48

jPaq's RegExp.fromWildExp function does something similar to this. The following is taken from the example that is on the front page of the site:

// Find a first substring that starts with a capital "C" and ends with a
// lower case "n".
alert("Where in the world is Carmen Sandiego?".findPattern("C*n"));

// Finds two words (first name and last name), flips their order, and places
// a comma between them.
alert("Christopher West".replacePattern("(<*>) (<*>)", "p", "$2, $1"));

// Finds the first number that is at least three numbers long.
alert("2 to the 64th is 18446744073709551616.".findPattern("#{3,}", "ol"));

score 0 · Answer 6 · answered Jun 10 '15 at 06:16

0

In R, there's the glob2rx function included in the base distribution:

http://stat.ethz.ch/R-manual/R-devel/library/utils/html/glob2rx.html

answered Jun 10 '15 at 06:16

nassimhddd

8,340
1
29
44

Create regex from glob expression

6 Answers6

Linked

Related