0

From Scrapy results, one unwanted nonASCII code \u2013 (aka character(150) or en dash) was in the title, such as u'Director/Senior Director \u2013 Pathology'. I am trying to use pipeline to remove \u2013 with a regular ,. But the following code didn't work. No error message be reported neither.

from datetime import datetime
from hashlib import md5
from scrapy.exceptions import DropItem
from twisted.enterprise import adbapi
import re
import string

class ReplaceASC2InTitlePipeline(object):
"""replace unwanted ASCII characters in titles"""

ascii_to_filter = ["\u2013",]

def process_item(self, item, spider):
    for word in self.ascii_to_filter:
        desc = item.get('title')

        if (desc) and word in desc:
            spider.log("\u2013 in '%s' was replace" % (item['title']) )

            item['title']=item['title'].replace("\u2013", ",")
            return item
    else:
        return item
LearnAWK
  • 549
  • 6
  • 17
  • I'm confused about the else part. If its a for...else clause, generally there will be a break in for block. or is it indented wrong? – sudo bangbang Oct 18 '15 at 11:22
  • The code is modified from some codes I found on Github which was used to discard unwanted items. But I don't have a lot of experience with Python. – LearnAWK Oct 18 '15 at 19:57

2 Answers2

0

"\u2013" should be unicode, so just replace:

ascii_to_filter = ["\u2013",]

with:

ascii_to_filter = [u"\u2013",]
eLRuLL
  • 18,488
  • 9
  • 73
  • 99
0

After reading this stackoverflow post Replace non-ASCII characters..., I came up with this code, which will filter out all non-ASCII characters in the titles. For my situation, the non-ASCII characters are not needed, so it works perfectly for me.

from datetime import datetime
from hashlib import md5
from scrapy.exceptions import DropItem
from twisted.enterprise import adbapi
import re
import string

class ReplaceASC2InTitlePipeline(object):
"""replace unwanted non-ASCII characters in titles"""

def process_item(self, item, spider):

    def remove_non_ascii(text):
        return ''.join(i for i in text if ord(i)<128)

    orig_titl = item.get('title')
    item['title'] = remove_non_ascii(orig_titl) 

    if item['title'] != orig_titl:
        spider.log("Non-ASCII character(s) was removed in '%s'" % (item['title']) )

    return item
Community
  • 1
  • 1
LearnAWK
  • 549
  • 6
  • 17