I am trying to scrapy on http://www.dmoz.org/Computers/Programming/Languages/Python/Books this page using scrapy 0.20.2.
I can do all of what i need like getting information and sort ...
However, I still get the \r and \t and \n in the results. for instance this is one json {"desc": ["\r\n\t\t\t\r\n ", " \r\n\t\t\t\r\n - The primary goal of this book is to promote object-oriented design using Python and to illustrate the use of the emerging object-oriented design patterns.\r\nA secondary goal of the book is to present mathematical tools just in time. Analysis techniques and proofs are presented as needed and in the proper context.\r\n \r\n "], "link": ["http://www.brpreiss.com/books/opus7/html/book.html"], "title": ["Data Structures and Algorithms with Object-Oriented Design Patterns in Python"]},
The data is correct, but i don't want to see the \t and \r and \n in the result.
my spider is
from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from dirbot.items import DmozItem
class DmozSpider(BaseSpider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/"
]
def parse(self, response):
sel = Selector(response)
sites = sel.xpath('//ul[@class="directory-url"]/li')
items = []
for site in sites:
item = DmozItem()
item['title'] = site.xpath('a/text()').extract()
item['link'] = site.xpath('a/@href').extract()
item['desc'] = site.xpath('text()').extract()
items.append(item)
return items