1

I am scraping a Portuguese website in Python 2.7, and I want to separate Latin words and numbers which are between parentheses. Each text looks like:

text = 'Obras de revisão e recuperação (45453000-7)'

I tried the following code:

#-*- coding: utf-8 -*-
import re
text = u'Obras de revisão e recuperação (45453000-7)'
re.sub(r'\([0-9-]+\)', u'', text).encode("utf8")

the output is:

'Obras de revis\xc3\xa3o e recupera\xc3\xa7\xc3\xa3o '

I want to remove parentheses as well and get an output like:

name = 'Obras de revisão e recuperação'
code = '45453000-7'
Cœur
  • 37,241
  • 25
  • 195
  • 267
mk_sch
  • 1,060
  • 4
  • 16
  • 31
  • 2
    Try declaring the `text` var wuth `u` prefix, and then use [`re.sub(r'\([0-9-]+\)', u'', text).encode("utf8")`](http://rextester.com/HEFHL85854). The pattern may be also `r"\([0-9]+(?:-[0-9]+)?\)"`. – Wiktor Stribiżew Jun 19 '17 at 11:38
  • I got UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 14 – mk_sch Jun 19 '17 at 11:45
  • Your default encoding must be set to UTF8. – Wiktor Stribiżew Jun 19 '17 at 11:46
  • If you're positive that the text will always be of the form "name (code)" then I wouldn't even use regex. Just split the text by the left parenthesis and then remove the right one from the code variable. `name, code = text.split(" ("); code = code.replace(")", "")` – tblznbits Jun 19 '17 at 11:47
  • thanx, but printing the name ends up with the same output as before: Obras de revis\xc3\xa3o e recupera\xc3\xa7\xc3\xa3o' – mk_sch Jun 19 '17 at 11:51
  • @mk_sch Did you encode it with UTF8 as in my example? – Wiktor Stribiżew Jun 19 '17 at 11:55
  • I added import sys reload(sys) sys.setdefaultencoding('UTF8') to my code and encode it like what you said. – mk_sch Jun 19 '17 at 11:57
  • Good, but did you use re.sub(...) **.encode("utf8")**? Note that you may just add `#-*- coding: utf-8 -*-` at the top of the file to make it be treated as UTF8 file. `sys.setdefaultencoding('UTF8')` is [considered a bad practice](https://stackoverflow.com/q/3828723/3832970). – Wiktor Stribiżew Jun 19 '17 at 12:01
  • yeah, I copied your code and executed it, but the result is same. Your code in the link works very well, but when I copy it to my notebook, it outputs a different result. I run on Python 2.7, maybe that is the reason. – mk_sch Jun 19 '17 at 12:04
  • I also posted a link to Python 2.7. Please post the code you are trying. Edit the question. – Wiktor Stribiżew Jun 19 '17 at 12:11
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/147066/discussion-between-mk-sch-and-wiktor-stribizew). – mk_sch Jun 19 '17 at 12:14
  • 1
    @mk_sch Ok, you did not use `print()`. You should have. But anyway, I see you just want to split the string. Try http://rextester.com/XPVHR17179 – Wiktor Stribiżew Jun 19 '17 at 12:20
  • 1
    Another demo with regex - http://rextester.com/QSWL90452 – Wiktor Stribiżew Jun 19 '17 at 12:26
  • @ Wiktor: thanx for your answer, both of them work with print, but when it comes to returning in a function, the problem is still there. – mk_sch Jun 20 '17 at 07:44

1 Answers1

2

It should work like that:

file: /tmp/foo.py

#-*- coding: utf-8 -*-
import re
text = u'Obras de revisão e recuperação (45453000-7)'
print re.sub(r'\([0-9-]+\)', u'', text)

Note, there is no .encode('utf-8') thing.

Now, in a python console:

>>> import re
>>> text = u'Obras de revisão e recuperação (45453000-7)'
>>> re.sub(r'\([0-9-]+\)', u'', text)
u'Obras de revis\xe3o e recupera\xe7\xe3o '
>>> print re.sub(r'\([0-9-]+\)', u'', text)
Obras de revisão e recuperação

As you can see, print re.sub(..) (aka unicode.__str__()) does not return same thing as unicode.__repr__().

I suspect that is what you are struggling with.

For reference: Difference between __str__ and __repr__ in Python

Arount
  • 9,853
  • 1
  • 30
  • 43
  • Nice explanation ! – Till Jun 20 '17 at 07:09
  • It works with print, but when I put your code within a function with return, that outputs the same result: u'Obras de revis\xe3o e recupera\xe7\xe3o ' – mk_sch Jun 20 '17 at 07:36
  • So `unicode.__str__()` returns readable content of the instance. When you want to display or store content as human readable string, this is what you will have to call. When you use `print` statement it implicitly call `__str__()` for your printed instance, you can also directly call `__str__` with `str(foo)`. In a Python console, if you don't use print but do `>>> foo` it calls `__repr__()` instead of `__str__` and this even if variable comes from a function (`def foo(): return u'lol'; print foo()` does exactly this same as `print u'lol'`). – Arount Jun 20 '17 at 07:53