My regular expression with cyrillic symbols doesn't work

Question

Good something, everyone. I have a kind of an SQL code (which is irrelevant for the matter) in which I'd like to find a number + "," + some string in Russian (my test string is "в"). Here's an example of a string in which I hope to find this:

insert into lemmas (id, word, lemma) values ("37","возбраняется","возбраняться");

Here's my code in python:

file_SQL = open('sql_code.txt', 'r', encoding = 'UTF-8')
SQLtext = file_SQL.read()
regux = '([0-9]+)?","' + wordform.lower() #wordform is "в"
find_it = re.search(regux, SQLtext)
found_it = find_it.group(1)
file_SQL.close()
return found_it

In the end, I want to get the particular number. The error I get with this code:

Traceback (most recent call last):
File "C:\Users\Неро\my_study\homework_4_2016\holy_guacamole_SQL.py", line 109, in <module>
main()
File "C:\Users\Неро\my_study\homework_4_2016\holy_guacamole_SQL.py", line 106, in main
imma_write_myself_a_SQL_file(val4, val3)
File "C:\Users\Неро\my_study\homework_4_2016\holy_guacamole_SQL.py", line 85, in imma_write_myself_a_SQL_file
f_id = find_f_id(wrdform)
File "C:\Users\Неро\my_study\homework_4_2016\holy_guacamole_SQL.py", line 95, in find_f_id
found_it = find_it.group(1)
AttributeError: 'NoneType' object has no attribute 'group'

Obviously, this means that re.search() found nothing. I've also tried to just search with this regular expression in notepad++, but it didn't work: A picture of me trying to find this number before a word starting with "в".

(Sorry for the Russian notepad, hope nobody minds it) As you can see in the picture a word starting with "в" exists in the file. Also I've tried several other regular expressions such as ([0-9]+)?\",\", ([0-9]{1,3})",".

And I've tried to search with re.findall(), but I basically got an empty list.

if you dont use cyrillic characters, does it work? Your regex seems ok, it should match. Maybe this can help you: [http://stackoverflow.com/questions/1716609/how-to-match-cyrillic-characters-with-a-regular-expression](http://stackoverflow.com/questions/1716609/how-to-match-cyrillic-characters-with-a-regular-expression) — Victor Lia Fook, Dec 18 '16 at 23:48
Thanks, that helped me locate the problem. The problem is in the cyrillic characters. But i don't get, how to use either `\p{L}` or `[\p{IsCyrillic}]`. I might be a bit not to good at regexs, so please, explain. — tsuy01, Dec 19 '16 at 00:08
I've tried to replicate your problem in notepad++, and was able to get a regex match: http://imgur.com/IM41hJr It matched cyrillics fine for me.. — 3eyes, Dec 19 '16 at 00:11
Some older versions of notepad++ reportedly had problems with regex matching, but if you're using a recent version it shouldn't be an issue. — 3eyes, Dec 19 '16 at 00:30

score 0 · Answer 1 · answered Dec 19 '16 at 02:28

Not sure this will help but it's at least good to share.

You can try to encode your string into unicode chars. For instance в is \x{0432}.

You can see the full match of возбраняется using [\x{0400}-\x{0450}]+ here: https://regex101.com/r/GRQBLK/1.

Here is a tool to convert to unicode: https://www.branah.com/unicode-converter. Then wrap it with \x{...}.

My regular expression with cyrillic symbols doesn't work

1 Answers1