Extracting words with non-ASCII characters by python regular expressions

Question

I want to extract some text that contains non-ASCII characters. The problem is that the program considers non-ASCII as delimiters! I tried this:

regex_fmla = '(?:title=[\'"])([:/.A-z?<_&\s=>0-9;-]+)'
c1='<a href="/climate/cote-d-ivoire.html" title="Climate data: Côte d\'Ivoire">Côte d\'Ivoire</a>'
c2= '<a href="/climate/cameroon.html" title="Climate data: Cameroon">Cameroon</a>'
c_list =[c1, c2]
for c in c_list 
    print re.findall(regex_fmla , c)

The result is:

['Climate data: C']
['Climate data: Cameroon']

Notice that The first country is not correct, as the series broken at ô, it should be:

['Climate data: Côte d\'Ivoire']

I searched in StackOverflow, and I found an answer that suggests using the flag re.UNICODE, but it returns the same wrong answer!

How can I fix this?

The character `ô` does not appear in your regex so, yes, it's a delimiter. `ô` is not an ASCII character, and it's not covered by your `A-z`. (Which incidentally - you may want to know - does also *not* mean "all uppercase and lowercase letters.) — Jongware, Dec 25 '16 at 10:49
Why don't you use BeautifulSoup to parse html? It's more light weight than re to parse html — Miguel, Dec 25 '16 at 10:50
.. Instead of that complicated regex (where after each next fault you need to squeeze in yet another character), you can search for that closing `"` *only: `"[^"]+\"`. (Just the relevant part.) — Jongware, Dec 25 '16 at 10:51
@RadLexus, Thanks, I know that A-z does not include special chars, and my question is how to include all non-English letters to my RegEx? — Mohammad ElNesr, Dec 25 '16 at 10:52
@Miguel, Thanks for your comment, but I never tried it before. Is it easier or faster than RegEx? — Mohammad ElNesr, Dec 25 '16 at 10:54
Yep, it is... There's already an answer to solve your problem below. — Miguel, Dec 25 '16 at 10:55
There is no regular regex code for that, you'd need to add each one separately. Does Python's regex support extended Unicode queries such as `\p{L}`? — Jongware, Dec 25 '16 at 10:56

Yevhen Kuzmovych · Accepted Answer · 2016-12-25T11:10:17.387

6

I would suggest using BeautifulSoup for parsing html:

from bs4 import BeautifulSoup as bs

c1='<a href="/climate/cote-d-ivoire.html" title="Climate data: Côte d\'Ivoire">Côte d\'Ivoire</a>'
c2='<a href="/climate/cameroon.html" title="Climate data: Cameroon">Cameroon</a>'


for c in [c1, c2]:
    soup = bs(c, 'html.parser')
    print(soup.find('a')['title'])

for more links (<a ...>) use .findAll() method:

for c in [bightml]:
    soup = bs(c, 'html.parser')
    for a in soup.findAll('a'):
        print(a['title'])

if you need anything that has a title attribute:

for a in soup.findAll(title=True):
    print(a['title'])

edited Dec 25 '16 at 11:10

answered Dec 25 '16 at 10:50

Yevhen Kuzmovych

10,940
7
28
48

This code solved the problem, but does it have a command similar to "FIndAll" at the regex? i.e. can it extracts all the parts that have "title=" in a long html line? – Mohammad ElNesr Dec 25 '16 at 10:57
Final question, Is there a way to store the results of FindAll to a list, similar to the result of re.FindAll ? – Mohammad ElNesr Dec 25 '16 at 11:16
1

@MohammadElNesr Don't think BS has this functionality. So just create list yourself and append with `a['title']`s – Yevhen Kuzmovych Dec 25 '16 at 11:19
1

@MohammadElNesr or: `[ a['title'] for a in soup.findAll(title=True) ]` – Yevhen Kuzmovych Dec 25 '16 at 11:20
I have selected your question as bet answer, Thank you for your efforts. – Mohammad ElNesr Dec 25 '16 at 11:22

score 3 · Answer 2 · answered Dec 25 '16 at 10:57

3

I also would suggest BeautifulSoup, but it seems you want to know how to include those special chars, you can change your regular expression to this:

ex = 'title="(.+?)"'

and then:

c1='<a href="/climate/cote-d-ivoire.html" title="Climate data: Côte d\'Ivoire">Côte d\'Ivoire</a>'

for x in re.findall(ex, c1):
    print x

Output:

Climate data: Côte d'Ivoire

answered Dec 25 '16 at 10:57

Carles Mitjans

4,786
3
19
38

It returned the following: `["Climate data: C\xc3\xb4te d'Ivoire"]` – Mohammad ElNesr Dec 25 '16 at 11:03
2

That is the string representation of `Climate data: Côte d'Ivoire`. The same way a newline will be stored as `'\n'`, `'ô'` will be stored as `'\xc3\xb4'`. If you print it, you will see it prints correctly. – Carles Mitjans Dec 25 '16 at 11:08
@MohammadElNesr another way to see it is: if you do `"ô" in "Climate data: C\xc3\xb4te d'Ivoire"` it will return `True` – Carles Mitjans Dec 25 '16 at 11:09
@CarlesMitjans: isn't that a Python v.2/3 problem? If I recall correctly, no longer an issue in v.3? – Jongware Dec 25 '16 at 11:15
2

@RadLexus i don't think we can call it a *problem*. It is how python 2.x is made. In python 3 `str` object is now `unicode` object, while 2.7 `str` object is now `bytes` object. See [this](http://stackoverflow.com/a/10814498/2148023) – Carles Mitjans Dec 25 '16 at 11:53

score 2 · Answer 3 · answered Dec 25 '16 at 11:00

2

I suggest using beautiful soup, but if you would prefer sticking to re:

import re

regex_fmla = '(?:title=[\'"])([\w :\':/.]+)'

c1 = '<a href="/climate/cote-d-ivoire.html" title="Climate data: Côte d\'Ivoire">Côte d\'Ivoire</a>'
c2 = '<a href="/climate/cameroon.html" title="Climate data: Cameroon">Cameroon</a>'
c_list = [c1, c2]

for c in c_list:
    print(re.findall(regex_fmla, c, flags=re.UNICODE))

I believe the problem that caused the re.UNICODE not to work was explicitly defining the alphabet in your expression as [A-z0-9]. If we change that to simply [\w] then the flag works correctly

answered Dec 25 '16 at 11:00

Ronikos

437
1
6
18

Unfortunately, it also breaks after the o, and returned: `['Climate data: C\xc3']` – Mohammad ElNesr Dec 25 '16 at 11:04
1

That is interesting, it works for me. Perhaps you have some sort of encoding problem - see http://stackoverflow.com/questions/2783079/how-do-i-convert-a-unicode-to-a-string-at-the-python-level – Ronikos Dec 25 '16 at 11:08

Extracting words with non-ASCII characters by python regular expressions

3 Answers3