how to properly use a unicode string in python regex

Question

I am getting an input regular expression from a user which is saved as a unicode string. Do I have to turn the input string into a raw string before compliling it as a regex object? Or is it unnecessary? Am I converting it to raw string properly?

import re
input_regex_as_unicode = u"^(.){1,36}$"
string_to_check = "342342dedsfs"

# leave as unicode
compiled_regex = re.compile(input_regex_as_unicode)
match_string = re.match(compiled_regex, string_to_check)

# convert to raw
compiled_regex = re.compile(r'' + input_regex_as_unicode)
match_string = re.match(compiled_regex, string_to_check)

@Ahsanul Haque, my question is more regular expression specific, whether the regex handles the unicode string properly when converting it into a regex object

Possible duplicate of [Getting a raw string from a unicode string in python](http://stackoverflow.com/questions/14066883/getting-a-raw-string-from-a-unicode-string-in-python) — Ahsanul Haque, Sep 26 '16 at 09:19
@Ahsanul Haque, my question is more regular expression specific, whether the regex handles the unicode string properly when converting it into a regex object. — Ivan Bilan, Sep 26 '16 at 09:43

Stop harming Monica · Accepted Answer · 2016-09-26T10:33:58.800

The re module handles both unicode strings and normal strings properly, you do not need to convert them to anything (but you should be consistent in your use of strings).

There is no such a thing like "raw strings". You can use raw string notation in your code if it helps you with strings containing backslashes. For instance to match a newline character you could use '\\n', u'\\n', r'\n' or ur'\n'.

Your use of the raw string notation in your example does nothing since r'' and '' evaluate to the same string.

how to properly use a unicode string in python regex

1 Answers1