1

I want to extract some text that contains non-ASCII characters. The problem is that the program considers non-ASCII as delimiters! I tried this:

regex_fmla = '(?:title=[\'"])([:/.A-z?<_&\s=>0-9;-]+)'
c1='<a href="/climate/cote-d-ivoire.html" title="Climate data: Côte d\'Ivoire">Côte d\'Ivoire</a>'
c2= '<a href="/climate/cameroon.html" title="Climate data: Cameroon">Cameroon</a>'
c_list =[c1, c2]
for c in c_list 
    print re.findall(regex_fmla , c)

The result is:

['Climate data: C']
['Climate data: Cameroon']

Notice that The first country is not correct, as the series broken at ô, it should be:

['Climate data: Côte d\'Ivoire']

I searched in StackOverflow, and I found an answer that suggests using the flag re.UNICODE, but it returns the same wrong answer!

How can I fix this?

Mohammad ElNesr
  • 2,477
  • 4
  • 27
  • 44
  • The character `ô` does not appear in your regex so, yes, it's a delimiter. `ô` is not an ASCII character, and it's not covered by your `A-z`. (Which incidentally - you may want to know - does also *not* mean "all uppercase and lowercase letters.) – Jongware Dec 25 '16 at 10:49
  • Why don't you use BeautifulSoup to parse html? It's more light weight than re to parse html – Miguel Dec 25 '16 at 10:50
  • .. Instead of that complicated regex (where after each next fault you need to squeeze in yet another character), you can search for that closing `"` *only: `"[^"]+\"`. (Just the relevant part.) – Jongware Dec 25 '16 at 10:51
  • @RadLexus, Thanks, I know that A-z does not include special chars, and my question is how to include all non-English letters to my RegEx? – Mohammad ElNesr Dec 25 '16 at 10:52
  • @Miguel, Thanks for your comment, but I never tried it before. Is it easier or faster than RegEx? – Mohammad ElNesr Dec 25 '16 at 10:54
  • Yep, it is... There's already an answer to solve your problem below. – Miguel Dec 25 '16 at 10:55
  • There is no regular regex code for that, you'd need to add each one separately. Does Python's regex support extended Unicode queries such as `\p{L}`? – Jongware Dec 25 '16 at 10:56

3 Answers3

6

I would suggest using BeautifulSoup for parsing html:

from bs4 import BeautifulSoup as bs

c1='<a href="/climate/cote-d-ivoire.html" title="Climate data: Côte d\'Ivoire">Côte d\'Ivoire</a>'
c2='<a href="/climate/cameroon.html" title="Climate data: Cameroon">Cameroon</a>'


for c in [c1, c2]:
    soup = bs(c, 'html.parser')
    print(soup.find('a')['title'])

for more links (<a ...>) use .findAll() method:

for c in [bightml]:
    soup = bs(c, 'html.parser')
    for a in soup.findAll('a'):
        print(a['title'])

if you need anything that has a title attribute:

for a in soup.findAll(title=True):
    print(a['title'])
Yevhen Kuzmovych
  • 10,940
  • 7
  • 28
  • 48
3

I also would suggest BeautifulSoup, but it seems you want to know how to include those special chars, you can change your regular expression to this:

ex = 'title="(.+?)"'

and then:

c1='<a href="/climate/cote-d-ivoire.html" title="Climate data: Côte d\'Ivoire">Côte d\'Ivoire</a>'

for x in re.findall(ex, c1):
    print x

Output:

Climate data: Côte d'Ivoire
Carles Mitjans
  • 4,786
  • 3
  • 19
  • 38
  • It returned the following: `["Climate data: C\xc3\xb4te d'Ivoire"]` – Mohammad ElNesr Dec 25 '16 at 11:03
  • 2
    That is the string representation of `Climate data: Côte d'Ivoire`. The same way a newline will be stored as `'\n'`, `'ô'` will be stored as `'\xc3\xb4'`. If you print it, you will see it prints correctly. – Carles Mitjans Dec 25 '16 at 11:08
  • @MohammadElNesr another way to see it is: if you do `"ô" in "Climate data: C\xc3\xb4te d'Ivoire"` it will return `True` – Carles Mitjans Dec 25 '16 at 11:09
  • @CarlesMitjans: isn't that a Python v.2/3 problem? If I recall correctly, no longer an issue in v.3? – Jongware Dec 25 '16 at 11:15
  • 2
    @RadLexus i don't think we can call it a *problem*. It is how python 2.x is made. In python 3 `str` object is now `unicode` object, while 2.7 `str` object is now `bytes` object. See [this](http://stackoverflow.com/a/10814498/2148023) – Carles Mitjans Dec 25 '16 at 11:53
2

I suggest using beautiful soup, but if you would prefer sticking to re:

import re

regex_fmla = '(?:title=[\'"])([\w :\':/.]+)'

c1 = '<a href="/climate/cote-d-ivoire.html" title="Climate data: Côte d\'Ivoire">Côte d\'Ivoire</a>'
c2 = '<a href="/climate/cameroon.html" title="Climate data: Cameroon">Cameroon</a>'
c_list = [c1, c2]

for c in c_list:
    print(re.findall(regex_fmla, c, flags=re.UNICODE))

I believe the problem that caused the re.UNICODE not to work was explicitly defining the alphabet in your expression as [A-z0-9]. If we change that to simply [\w] then the flag works correctly

Ronikos
  • 437
  • 1
  • 6
  • 18
  • Unfortunately, it also breaks after the o, and returned: `['Climate data: C\xc3']` – Mohammad ElNesr Dec 25 '16 at 11:04
  • 1
    That is interesting, it works for me. Perhaps you have some sort of encoding problem - see http://stackoverflow.com/questions/2783079/how-do-i-convert-a-unicode-to-a-string-at-the-python-level – Ronikos Dec 25 '16 at 11:08