0

This is my code

# -*- coding: utf-8 -*-
import json
import re

with open("/Users/paul/Desktop/file.json") as json_file:
    file = json.load(json_file)
print file["desc"]

key="capacità"
result = re.findall("((?:[\S,]+\s+){0,3})"+key+"\s+((?:[\S,]+\s*){0,3})", file["desc"], re.IGNORECASE)
print result

This is the content of the file

{
    "desc": "Frigocongelatore, capacit\u00e0 di 215 litri, h 122 cm, classe A+"
}

My result is []

but what I want is result = "capacità"

Usi Usi
  • 2,967
  • 5
  • 38
  • 69
  • Which version of Python are you using? – Blckknght Oct 05 '15 at 22:47
  • Possible duplicate of [python and regular expression with unicode](http://stackoverflow.com/questions/393843/python-and-regular-expression-with-unicode) – MaxZoom Oct 05 '15 at 22:54
  • I'm using Python 2.7.1 – Usi Usi Oct 05 '15 at 22:55
  • @UsiUsi capacit\u00e0 and capacità is the same word! It is you editor that is not displaying chars correctly. for example I have run my function as print(find_context(' capacit\u00e0',0,3,s) ) and it works, because comp sees only 0' and 1's. – LetzerWille Oct 05 '15 at 23:36
  • Ok but I can't catch capacità with my regex... why? – Usi Usi Oct 05 '15 at 23:37
  • @UsiUsi will look at it now – LetzerWille Oct 05 '15 at 23:40
  • @UsiUsi run this simply this on you string print(re.findall("capacità", s, )) you will get ['capacità'], so the problem is with regex you constructed. I may o may not find error in it. But I will try – LetzerWille Oct 05 '15 at 23:45
  • string="Frigocongelatore, capacit\u00e0 di 215 litri, h 122 cm, classe A+" print(re.findall("capacità", string)) Result [] try it – Usi Usi Oct 05 '15 at 23:49
  • @UsiUsi run this: print(re.findall("\w+\W+capacità\W+\w+\W+\w+", s, re.UNICODE)) It will find one word ahead and three after. I guess you now know want to do..... – LetzerWille Oct 05 '15 at 23:54
  • var="Frigocongelatore, capacit\u00e0 di 215 litri, h 122 cm, classe A+" >>> print(re.findall("\w+\W+capacità\W+\w+\W+\w+", var, re.UNICODE)) Doesn't work – Usi Usi Oct 05 '15 at 23:55
  • Possible duplicate of [Python match key with accented characters in a regex with Python](http://stackoverflow.com/questions/32959813/python-match-key-with-accented-characters-in-a-regex-with-python) – tripleee Oct 06 '15 at 03:49

2 Answers2

1

You need to treat your string as an Unicode string, like this:

str = u"Frigocongelatore, capacit\u00e0 di 215 litri, h 122 cm, classe A+"

And as you can see if you print str.encode('utf-8') you'll get:

Frigocongelatore, capacità di 215 litri, h 122 cm, classe A+

The same way you can make your regex string an unicode or raw string with u or r respectively.

Diogo Rocha
  • 9,759
  • 4
  • 48
  • 52
0

You can use this function to display different encodings.

The default encoding on your editor should be UTF-8. Check you settings with sys.getdefaultencoding().

def find_context(word_, n_before, n_after, string_):
    # finds the word and n words before and after it
    import re
    b= '\w+\W+'  * n_before
    a=  '\W+\w+' * n_after
    pattern = '(' + b + word_ + a + ')'
    return re.search(pattern, string_).groups(1)[0]

s = "Frigocongelatore,  capacità di 215 litri, h 122 cm, classe A+"

# find 0 words before and 3 after the word capacità
print(find_context('capacità',0,3,s) )

capacità di 215 litri

print(find_context(' capacit\u00e0',0,3,s) )

 capacità di 215 litri
nhahtdh
  • 55,989
  • 15
  • 126
  • 162
LetzerWille
  • 5,355
  • 4
  • 23
  • 26
  • It works but My problem is in the encoding... I have capacit\u00e0 not capacità – Usi Usi Oct 05 '15 at 23:14
  • 1
    @Usi Usi you mean is is how displayed on your comp? You have to configure you editor environment. run print(sys.getdefaultencoding()) it shoud display you ecoding. Very likely it is not Utf-8 – LetzerWille Oct 05 '15 at 23:22
  • I have totally rewritten my answer – Usi Usi Oct 06 '15 at 00:56