1

I need to scrape some text from within a script on a page, and save that text within a scrapy item, presumably as a UTF-8 string. However the actual literal text I'm scraping from has special characters written out as what I believe to be UTF hex. e.g. "-" is written as "\x2f". How can I scrape characters represented as "\x2f" but save them as "-" in my scrapy item?

Excerpt of contents on scraped page:

<script type="text/javascript">

[approx 100 various lines of script, omitted]

"author": "Kurt\x20Vonnegut",
"internetPrice": "799",
"inventoryType": "new",
"title": "Slaughterhouse\x2DFive",
"publishedYear": "1999",

[approx 50 additional various lines of script, removed]

</script>

My scrapy script goes like this:

pattern_title = r'"title": "(.+)"'
title_raw = response.xpath('//script[@type="text/javascript"]').re(pattern_title)
item['title'] = title_raw[0]

For this item, scrapy's output will return:

'author': u'Kurt\x20Vonnegut', 'title': u'Slaughterhouse\x2DFive'

Ideally, I would like:

'author': 'Kurt Vonnegut', 'title': 'Slaughterhouse Five'

Things I've tried with no change to the output:

  • Change last line to: item['title'] = title_raw[0].decode('utf-8')
  • Change last line to: item['title'] = title_raw[0].encode('latin1').decode('utf-8')

Finally, in case it needs to be explicitly stated, I do not have control over how this information is being displayed on the site I'm scraping.

Chris
  • 191
  • 16

2 Answers2

1

Inspired by Converting \x escaped string to UTF-8, I solved this by using .decode('string-escape'), as follows:

pattern_title = r'"title": "(.+)"'
title_raw = response.xpath('//script[@type="text/javascript"]').re(pattern_title)
title_raw[0] = title_raw[0].decode('string-escape')
item['title'] = title_raw[0]
Chris
  • 191
  • 16
0

You can use urllib's unquote function.

On Python 3.x:

from urllib.parse importe unquote
unquote("Kurt\x20Vonnegut")

On Python 2.7:

from urllib import unquote
unquote("Kurt\x20Vonnegut")

Take a look on Item Loaders and Input Processors so you can do this for all scraped fields.

  • Interesting! Thanks for the suggestion. While I can confirm that `item['title'] = unquote("Kurt\x20Vonnegut")` will successfully return 'title': 'Kurt Vonnegut' for all scraped pages, if I do `item['title'] = unquote(title_raw[0])` then I still get 'title': u'Slaughterhouse\x2DFive'. Hmm. I'll (re-)read the resources you've suggested. Thanks again. – Chris Mar 29 '19 at 20:37
  • @Chris Sorry, I only tried to unquote literal strings here... Anyway, there is a method re_first() so you don't have to use re() and get the first match. – Luiz Rodrigues da Silva Mar 29 '19 at 20:58
  • I'll check it out and be sure to accept your answer if I can get it to work. Thank you again for taking the time to answer! :) – Chris Mar 29 '19 at 21:58