I need to scrape some text from within a script on a page, and save that text within a scrapy item, presumably as a UTF-8 string. However the actual literal text I'm scraping from has special characters written out as what I believe to be UTF hex. e.g. "-" is written as "\x2f". How can I scrape characters represented as "\x2f" but save them as "-" in my scrapy item?
Excerpt of contents on scraped page:
<script type="text/javascript">
[approx 100 various lines of script, omitted]
"author": "Kurt\x20Vonnegut",
"internetPrice": "799",
"inventoryType": "new",
"title": "Slaughterhouse\x2DFive",
"publishedYear": "1999",
[approx 50 additional various lines of script, removed]
</script>
My scrapy script goes like this:
pattern_title = r'"title": "(.+)"'
title_raw = response.xpath('//script[@type="text/javascript"]').re(pattern_title)
item['title'] = title_raw[0]
For this item, scrapy's output will return:
'author': u'Kurt\x20Vonnegut', 'title': u'Slaughterhouse\x2DFive'
Ideally, I would like:
'author': 'Kurt Vonnegut', 'title': 'Slaughterhouse Five'
Things I've tried with no change to the output:
- Change last line to: item['title'] = title_raw[0].decode('utf-8')
- Change last line to: item['title'] = title_raw[0].encode('latin1').decode('utf-8')
Finally, in case it needs to be explicitly stated, I do not have control over how this information is being displayed on the site I'm scraping.