I want to extract some text that contains non-ASCII characters. The problem is that the program considers non-ASCII as delimiters! I tried this:
regex_fmla = '(?:title=[\'"])([:/.A-z?<_&\s=>0-9;-]+)'
c1='<a href="/climate/cote-d-ivoire.html" title="Climate data: Côte d\'Ivoire">Côte d\'Ivoire</a>'
c2= '<a href="/climate/cameroon.html" title="Climate data: Cameroon">Cameroon</a>'
c_list =[c1, c2]
for c in c_list
print re.findall(regex_fmla , c)
The result is:
['Climate data: C']
['Climate data: Cameroon']
Notice that The first country is not correct, as the series broken at ô, it should be:
['Climate data: Côte d\'Ivoire']
I searched in StackOverflow, and I found an answer that suggests using the flag re.UNICODE, but it returns the same wrong answer!
How can I fix this?