Dropping duplicate item value in Scrapy pipeline

Question

I have some stored results in a .json file in this format:

(one item per line)

{"category": ["ctg1"], "pages": 3, "websites": ["x1.com","x2.com","x5.com"]}
{"category": ["ctg2"], "pages": 2, "websites": ["x1.com", "d4.com"]}
                    .
                    .

I have tried to remove the duplicate value without deleting the whole item but without success.

the code :

import scrapy
import json
import codecs
from scrapy.exceptions import DropItem

class ResultPipeline(object):

    def __init__(self):
        self.ids_seen = set()
        self.file = codecs.open('results.json', 'w', encoding='utf-8')

    def process_item(self, item, spider):
        for sites in item['websites']:
            if sites in self.ids_seen:
                raise DropItem("Duplicate item found: %s" % sites)
            else:
                self.ids_seen.add(sites)
        line = json.dumps(dict(item), ensure_ascii=False) + "\n"
        self.file.write(line)
        return item

    def spider_closed(self, spider):
        self.file.close()

You cannot delete it in the `for sites in item` loop. You could create a list of duplicates and delete it outside that loop. Alternatively you could make yours `websites` container a `set` instead of a `list`. You can use `OrderedDicst` as shown here: http://stackoverflow.com/questions/12878833/python-unique-list-using-set — Thane Plummer, Aug 29 '15 at 20:08
Still nothing . I have tried allmost all the links . I believe it's not possible to achieve it in that way , maybe i have to try something different than that . However your answer was usefull though , thanks . — Prometheus, Aug 30 '15 at 22:00

score 3 · Answer 1 · answered Aug 31 '15 at 00:50

Instead of deleting the duplicate items, just rebuild the list of sites that aren't already in the ids_seen list. Sample code below should work, although it's not in your class structure.

import json


line1 = '{"category": ["ctg1"], "pages": 3, "websites": ["x1.com","x2.com","x5.com"]}'
line2 = '{"category": ["ctg2"], "pages": 2, "websites": ["x1.com", "d4.com"]}'

lines = (line1, line2)

ids_seen = set()

def process_item(item):
    item_unique_sites = []
    for site in item['websites']:
        if not site in ids_seen:
            ids_seen.add(site)
            item_unique_sites.append(site)
    # Delete the duplicates
    item['websites'] = item_unique_sites
    line = json.dumps(dict(item), ensure_ascii=False) + "\n"
    print line
    #self.file.write(line)
    return item


for line in lines:
    json_data = json.loads(line)
    process_item(json_data)

Dropping duplicate item value in Scrapy pipeline

1 Answers1