Scrapy: Why extracted strings are in this format?

Question

I'm doing

item['desc'] = site.select('a/text()').extract()

but this will be printed like this

[u'\n                    A mano libera\n                  ']

What must I do to tim and remove strange chars like [u'\n , the traling space and '] ?

I cannot trim (strip)

exceptions.AttributeError: 'list' object has no attribute 'strip'

and if converting to string and then stripping, the result was the string above, which I suppose to be in UTF-8

score 9 · Answer 1 · answered Jun 10 '13 at 23:50

There's a nice solution to this using Item Loaders. Item Loaders are objects that get data from responses, process the data and build Items for you. Here's an example of an Item Loader that will strip the strings and return the first value that matches the XPath, if any:

from scrapy.contrib.loader import XPathItemLoader
from scrapy.contrib.loader.processor import MapCompose, TakeFirst

class MyItemLoader(XPathItemLoader):
    default_item_class = MyItem
    default_input_processor = MapCompose(lambda string: string.strip())
    default_output_processor = TakeFirst()

And you use it like this:

def parse(self, response):
    loader = MyItemLoader(response=response)
    loader.add_xpath('desc', 'a/text()')
    return loader.load_item()

score 8 · Accepted Answer · answered Jun 08 '13 at 14:48

8

The html page may very well contains these whitespaces characters.

What you retrieve a list of unicode strings, which is why you can't simply call strip on it. If you want to strip these whitespaces characters from each string in this list, you can run the following:

>>> [s.strip() for s in [u'\n                    A mano libera\n                  ']]
[u'A mano libera']

If only the first element matters to you, than simply do:

>>> [u'\n                    A mano libera\n                  '][0].strip()
u'A mano libera'

answered Jun 08 '13 at 14:48

icecrime

74,451
13
99
111

Please, can you edit to show me how to set the result to a variable ? I cannot 'strip' the result of extract() ! – realtebo Jun 08 '13 at 14:50
Wow ! You found the problem:: **It was a list**, so using `item['desc'] = str(site.select('a/text()').extract()[0]).strip();` I've got what I need ! – realtebo Jun 08 '13 at 14:52

score 1 · Answer 3 · answered Jul 18 '16 at 11:32

1

desc = site.select('a/text()').extract()
desc=[s.strip() for s in desc]
item['desc']=desc[0]

answered Jul 18 '16 at 11:32

Nanhe Kumar

15,498
5
79
71

While this code may answer the question, providing additional context regarding why and/or how this code answers the question improves its long-term value. (This answer got into the 'Low Quality Posts' queue ;) .. ) – FirstOne Jul 18 '16 at 18:11
@FirstOne : I am not agree with your reply maximum user on stackover flow they only want exact answer – Parnit Das Jul 20 '16 at 05:58

Scrapy: Why extracted strings are in this format?

3 Answers3

Linked