2

I'm writing small crawler using scrapy. One of XPath's is containing price followed by "zł" (polish currency mark) the problem is it's obfuscated by new line characters, spaces and non breaking spaces. so when I do :

sel.xpath("div/div/span/span/text()[normalize-space(.)]").extract()

I get:

[u'\n            1\xa0740,00 z\u0142\n            \n            \n                ']

Which I want to change to

[u'1740,00']

or simply into float variable. What is the /best/simplest/fastest way to do this?

Lord_JABA
  • 2,545
  • 7
  • 31
  • 58

2 Answers2

5

You can use re.findall to extract the characters from the string:

>>> import re
>>> s = u'\n            1\xa0740,00 z\u0142\n            \n            \n            '
>>> L = re.findall(r'[\d,]', s)
>>> "".join(L)
'1740,00'
Eugene Yarmash
  • 142,882
  • 41
  • 325
  • 378
  • 1
    i got it going like that `raw_price = sel.xpath("div/div/span/span/text()").extract() item['cena']= raw_price[0].strip()` – Lord_JABA Dec 20 '15 at 18:57
  • @Lord_JABA: `.strip()` only removes leading and trailing whitespace. Perhaps, the most flexible solution here is using a regular expression. – Eugene Yarmash Dec 20 '15 at 19:12
  • But these characters doesn't come in the final saving if we use `.strip()` – Nikhil Parmar Dec 20 '15 at 19:15
  • @Nikhil: `s.strip()` will give you `'1\xa0740,00 zł'` which is not what the OP wanted. – Eugene Yarmash Dec 20 '15 at 19:19
  • @eugeney I encounter these characters a lot while scraping but when I insert into my db say `mongo` why I get the actual data or even if I print on a csv may be I am wrong – Nikhil Parmar Dec 20 '15 at 19:20
1

If you are interested only in ascii digits then the fastest method is to use bytes.translate():

import string

keep = string.digits.encode() + b',' # characters to keep
delete = bytearray(set(range(0x100)) - set(bytearray(keep))) # to delete
result = unicode_string.encode('ascii', 'ignore').translate(None, delete).decode()

You could write it more succinctly using Unicode .translate():

import string
import sys

keep = set(map(ord, string.digits + ',')) # characters to keep
table = dict.fromkeys(i for i in range(sys.maxunicode + 1) if i not in keep)
result = unicode_string.translate(table)

The result is the same but before Python 3.5, it is always dog-slow (the situation is better in Python 3.5 for ascii-only case).

Community
  • 1
  • 1
jfs
  • 399,953
  • 195
  • 994
  • 1,670