separating letters and non alphabetic characters from a non-English text in Python

Question

I am scraping a Portuguese website in Python 2.7, and I want to separate Latin words and numbers which are between parentheses. Each text looks like:

text = 'Obras de revisão e recuperação (45453000-7)'

I tried the following code:

#-*- coding: utf-8 -*-
import re
text = u'Obras de revisão e recuperação (45453000-7)'
re.sub(r'\([0-9-]+\)', u'', text).encode("utf8")

the output is:

'Obras de revis\xc3\xa3o e recupera\xc3\xa7\xc3\xa3o '

I want to remove parentheses as well and get an output like:

name = 'Obras de revisão e recuperação'
code = '45453000-7'

Try declaring the `text` var wuth `u` prefix, and then use [`re.sub(r'\([0-9-]+\)', u'', text).encode("utf8")`](http://rextester.com/HEFHL85854). The pattern may be also `r"\([0-9]+(?:-[0-9]+)?\)"`. — Wiktor Stribiżew, Jun 19 '17 at 11:38
I got UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 14 — mk_sch, Jun 19 '17 at 11:45
If you're positive that the text will always be of the form "name (code)" then I wouldn't even use regex. Just split the text by the left parenthesis and then remove the right one from the code variable. `name, code = text.split(" ("); code = code.replace(")", "")` — tblznbits, Jun 19 '17 at 11:47
thanx, but printing the name ends up with the same output as before: Obras de revis\xc3\xa3o e recupera\xc3\xa7\xc3\xa3o' — mk_sch, Jun 19 '17 at 11:51
I added import sys reload(sys) sys.setdefaultencoding('UTF8') to my code and encode it like what you said. — mk_sch, Jun 19 '17 at 11:57
Good, but did you use re.sub(...) **.encode("utf8")**? Note that you may just add `#-*- coding: utf-8 -*-` at the top of the file to make it be treated as UTF8 file. `sys.setdefaultencoding('UTF8')` is [considered a bad practice](https://stackoverflow.com/q/3828723/3832970). — Wiktor Stribiżew, Jun 19 '17 at 12:01
yeah, I copied your code and executed it, but the result is same. Your code in the link works very well, but when I copy it to my notebook, it outputs a different result. I run on Python 2.7, maybe that is the reason. — mk_sch, Jun 19 '17 at 12:04
I also posted a link to Python 2.7. Please post the code you are trying. Edit the question. — Wiktor Stribiżew, Jun 19 '17 at 12:11
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/147066/discussion-between-mk-sch-and-wiktor-stribizew). — mk_sch, Jun 19 '17 at 12:14
@mk_sch Ok, you did not use `print()`. You should have. But anyway, I see you just want to split the string. Try http://rextester.com/XPVHR17179 — Wiktor Stribiżew, Jun 19 '17 at 12:20
@ Wiktor: thanx for your answer, both of them work with print, but when it comes to returning in a function, the problem is still there. — mk_sch, Jun 20 '17 at 07:44

score 2 · Accepted Answer · answered Jun 19 '17 at 16:47

2

It should work like that:

file: /tmp/foo.py

#-*- coding: utf-8 -*-
import re
text = u'Obras de revisão e recuperação (45453000-7)'
print re.sub(r'\([0-9-]+\)', u'', text)

Note, there is no .encode('utf-8') thing.

Now, in a python console:

>>> import re
>>> text = u'Obras de revisão e recuperação (45453000-7)'
>>> re.sub(r'\([0-9-]+\)', u'', text)
u'Obras de revis\xe3o e recupera\xe7\xe3o '
>>> print re.sub(r'\([0-9-]+\)', u'', text)
Obras de revisão e recuperação

As you can see, print re.sub(..) (aka unicode.__str__()) does not return same thing as unicode.__repr__().

I suspect that is what you are struggling with.

For reference: Difference between __str__ and __repr__ in Python

answered Jun 19 '17 at 16:47

Arount

9,853
1
30
43

Nice explanation ! – Till Jun 20 '17 at 07:09
It works with print, but when I put your code within a function with return, that outputs the same result: u'Obras de revis\xe3o e recupera\xe7\xe3o ' – mk_sch Jun 20 '17 at 07:36
So `unicode.__str__()` returns readable content of the instance. When you want to display or store content as human readable string, this is what you will have to call. When you use `print` statement it implicitly call `__str__()` for your printed instance, you can also directly call `__str__` with `str(foo)`. In a Python console, if you don't use print but do `>>> foo` it calls `__repr__()` instead of `__str__` and this even if variable comes from a function (`def foo(): return u'lol'; print foo()` does exactly this same as `print u'lol'`). – Arount Jun 20 '17 at 07:53

separating letters and non alphabetic characters from a non-English text in Python

1 Answers1