Python scraping on page still contains chars like \r \n \t

Question

I am trying to scrapy on http://www.dmoz.org/Computers/Programming/Languages/Python/Books this page using scrapy 0.20.2.

I can do all of what i need like getting information and sort ...

However, I still get the \r and \t and \n in the results. for instance this is one json {"desc": ["\r\n\t\t\t\r\n ", " \r\n\t\t\t\r\n - The primary goal of this book is to promote object-oriented design using Python and to illustrate the use of the emerging object-oriented design patterns.\r\nA secondary goal of the book is to present mathematical tools just in time. Analysis techniques and proofs are presented as needed and in the proper context.\r\n \r\n "], "link": ["http://www.brpreiss.com/books/opus7/html/book.html"], "title": ["Data Structures and Algorithms with Object-Oriented Design Patterns in Python"]},

The data is correct, but i don't want to see the \t and \r and \n in the result.

my spider is

from scrapy.spider import BaseSpider
from scrapy.selector import Selector

from dirbot.items import DmozItem

class DmozSpider(BaseSpider):
   name = "dmoz"
   allowed_domains = ["dmoz.org"]
   start_urls = [
       "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/"
   ]

   def parse(self, response):
       sel = Selector(response)
       sites = sel.xpath('//ul[@class="directory-url"]/li')
       items = []
       for site in sites:
           item = DmozItem()
           item['title'] = site.xpath('a/text()').extract()
           item['link'] = site.xpath('a/@href').extract()
           item['desc'] = site.xpath('text()').extract()
           items.append(item)
       return items

\r and \n are end-of-line (EOL) characters and \t is a tab character. The most common way of removing them is to use rstrip() — e h, Jan 13 '14 at 12:44
@emh kindly provide and example, and should i use that on my item class please? — Marco Dinatsoli, Jan 13 '14 at 12:53
@emh when i tried to make `site.xpath('a/text()').extract().rstrip()` i got an empty result — Marco Dinatsoli, Jan 13 '14 at 13:00
You could use something like `item['desc'] = map(unicode.strip, site.xpath('a/text()').extract())` — paul trmbrth, Jan 13 '14 at 13:08
Yes as paul states, there are several ways to do this. With rstrip you need to tell python what you want to strip. Something like .rstrip('\r\n\t') will tell it to strip EOLs and tabs. This might help: http://stackoverflow.com/questions/10711116/strip-spaces-tabs-newlines-python — e h, Jan 13 '14 at 13:36

score 3 · Answer 1 · answered Mar 30 '14 at 23:40

I used:

def parse(self, response):
    sel = Selector(response)
    sites = sel.xpath('//ul/li')
    items = []
    for site in sites:
        item = DmozItem()
        item['title'] = map(unicode.strip,site.xpath('a/text()').extract())
        item['link'] = map(unicode.strip, site.xpath('a/@href').extract())
        item['desc'] = map(unicode.strip, site.xpath('text()').extract())
        items.append(item)
    print "hello"
    return items

and it works. I am not sure what it is, but I am still reading up on unicode.strip. I hope this helped

score 0 · Answer 2 · answered Jan 13 '14 at 17:44

Here is another way to do this (I used your JSON data):

>>> data = {"desc": ["\r\n\t\t\t\r\n ", " \r\n\t\t\t\r\n - The primary goal of this book is to promote object-oriented design using Python and to illustrate the use of the emerging object-oriented design patterns.\r\nA secondary goal of the book is to present mathematical tools just in time. Analysis techniques and proofs are presented as needed and in the proper context.\r\n \r\n "], "link": ["http://www.brpreiss.com/books/opus7/html/book.html"], "title": ["Data Structures and Algorithms with Object-Oriented Design Patterns in Python"]}

>>> clean_data = ''.join(data['desc'])

>>> print clean_data.strip(' \r\n\t')

Output:

- The primary goal of this book is to promote object-oriented design using Python and to illustrate the use of the emerging object-oriented design patterns.
A secondary goal of the book is to present mathematical tools just in time. Analysis techniques and proofs are presented as needed and in the proper context.

Instead of:

['\r\n\t\t\t\r\n ', ' \r\n\t\t\t\r\n - The primary goal of this book is to promote object-oriented design using Python and to illustrate the use of the emerging object-oriented design patterns.\r\nA secondary goal of the book is to present mathematical tools just in time. Analysis techniques and proofs are presented as needed and in the proper context.\r\n \r\n ']

score 0 · Answer 3 · edited May 23 '17 at 12:02

Assuming you want all \r, \n, and \t removed (not just the stuff on the edges), while still keeping the form of the JSON, you could try the following:

def normalize_whitespace(json):
    if isinstance(json, str):
        return ' '.join(json.split())

    if isinstance(json, dict):
        it = json.items() # iteritems in Python 2
    elif isinstance(json, list):
        it = enumerate(json)

    for k, v in it:
        json[k] = normalize_whitespace(v)

    return json

Usage:

>>> normalize_whitespace({"desc": ["\r\n\t\t\t\r\n ", " \r\n\t\t\t\r\n - The primary goal of this book is to promote object-oriented design using Python and to illustrate the use of the emerging object-oriented design patterns.\r\nA secondary goal of the book is to present mathematical tools just in time. Analysis techniques and proofs are presented as needed and in the proper context.\r\n \r\n "], "link": ["http://www.brpreiss.com/books/opus7/html/book.html"], "title": ["Data Structures and Algorithms with Object-Oriented Design Patterns in Python"]})
{'title': ['Data Structures and Algorithms with Object-Oriented Design Patterns in Python'], 'link': ['http://www.brpreiss.com/books/opus7/html/book.html'], 'desc': ['', '- The primary goal of this book is to promote object-oriented design using Python and to illustrate the use of the emerging object-oriented design patterns. A secondary goal of the book is to present mathematical tools just in time. Analysis techniques and proofs are presented as needed and in the proper context.']}

As reminded by https://stackoverflow.com/a/10711166/138772, the split-join method is probably better for this than regular expression replacement, as it combines strip functionality with the whitespace normalization.

Python scraping on page still contains chars like \r \n \t

my spider is

3 Answers3