accented characters in a regex with Python

Question

This is my code

# -*- coding: utf-8 -*-
import json
import re

with open("/Users/paul/Desktop/file.json") as json_file:
    file = json.load(json_file)
print file["desc"]

key="capacità"
result = re.findall("((?:[\S,]+\s+){0,3})"+key+"\s+((?:[\S,]+\s*){0,3})", file["desc"], re.IGNORECASE)
print result

This is the content of the file

{
    "desc": "Frigocongelatore, capacit\u00e0 di 215 litri, h 122 cm, classe A+"
}

My result is []

but what I want is result = "capacità"

Possible duplicate of [python and regular expression with unicode](http://stackoverflow.com/questions/393843/python-and-regular-expression-with-unicode) — MaxZoom, Oct 05 '15 at 22:54
@UsiUsi capacit\u00e0 and capacità is the same word! It is you editor that is not displaying chars correctly. for example I have run my function as print(find_context(' capacit\u00e0',0,3,s) ) and it works, because comp sees only 0' and 1's. — LetzerWille, Oct 05 '15 at 23:36
@UsiUsi run this simply this on you string print(re.findall("capacità", s, )) you will get ['capacità'], so the problem is with regex you constructed. I may o may not find error in it. But I will try — LetzerWille, Oct 05 '15 at 23:45
string="Frigocongelatore, capacit\u00e0 di 215 litri, h 122 cm, classe A+" print(re.findall("capacità", string)) Result [] try it — Usi Usi, Oct 05 '15 at 23:49
@UsiUsi run this: print(re.findall("\w+\W+capacità\W+\w+\W+\w+", s, re.UNICODE)) It will find one word ahead and three after. I guess you now know want to do..... — LetzerWille, Oct 05 '15 at 23:54
var="Frigocongelatore, capacit\u00e0 di 215 litri, h 122 cm, classe A+" >>> print(re.findall("\w+\W+capacità\W+\w+\W+\w+", var, re.UNICODE)) Doesn't work — Usi Usi, Oct 05 '15 at 23:55
Possible duplicate of [Python match key with accented characters in a regex with Python](http://stackoverflow.com/questions/32959813/python-match-key-with-accented-characters-in-a-regex-with-python) — tripleee, Oct 06 '15 at 03:49

score 1 · Answer 1 · answered Oct 05 '15 at 22:54

1

You need to treat your string as an Unicode string, like this:

str = u"Frigocongelatore, capacit\u00e0 di 215 litri, h 122 cm, classe A+"

And as you can see if you print str.encode('utf-8') you'll get:

Frigocongelatore, capacità di 215 litri, h 122 cm, classe A+

The same way you can make your regex string an unicode or raw string with u or r respectively.

answered Oct 05 '15 at 22:54

Diogo Rocha

9,759
4
48
52

how can I convert "capacità" in capacit\u00e0 ? – Usi Usi Oct 05 '15 at 23:02
ok I have understood... but if I read the "Frigocongelatore, capacit\u00e0 di 215 litri, h 122 cm, classe A+" from a file... How can I put the u in front of it to say that is a unicode? – Usi Usi Oct 05 '15 at 23:18
var="Frigocongelatore, capacit\u00e0 di 215 litri, h 122 cm, classe A+" var=u(var) is not valid – Usi Usi Oct 05 '15 at 23:18
I have totally rewritten my answer – Usi Usi Oct 06 '15 at 00:56

score 0 · Answer 2 · edited Oct 06 '15 at 07:58

0

You can use this function to display different encodings.

The default encoding on your editor should be UTF-8. Check you settings with sys.getdefaultencoding().

def find_context(word_, n_before, n_after, string_):
    # finds the word and n words before and after it
    import re
    b= '\w+\W+'  * n_before
    a=  '\W+\w+' * n_after
    pattern = '(' + b + word_ + a + ')'
    return re.search(pattern, string_).groups(1)[0]

s = "Frigocongelatore,  capacità di 215 litri, h 122 cm, classe A+"

# find 0 words before and 3 after the word capacità
print(find_context('capacità',0,3,s) )

capacità di 215 litri

print(find_context(' capacit\u00e0',0,3,s) )

 capacità di 215 litri

edited Oct 06 '15 at 07:58

nhahtdh

55,989
15
126
162

answered Oct 05 '15 at 22:55

LetzerWille

5,355
4
23
26

It works but My problem is in the encoding... I have capacit\u00e0 not capacità – Usi Usi Oct 05 '15 at 23:14
1

@Usi Usi you mean is is how displayed on your comp? You have to configure you editor environment. run print(sys.getdefaultencoding()) it shoud display you ecoding. Very likely it is not Utf-8 – LetzerWille Oct 05 '15 at 23:22
I have totally rewritten my answer – Usi Usi Oct 06 '15 at 00:56

accented characters in a regex with Python

2 Answers2

Linked

Related