1

Im trying to avoid scraping the same information more than once, i run this spider every morning to scrape jobs from a job board, then i copy them into excel and press the remove duplicates from the list using URL. i would like to do this in scrapy (i can change the txt file to csv). i would be happy to implement middleware to

this is the pipleing that i am trying to use

class CraigslistSamplePipeline(object):



    def find_row_by_id(item):
        with open('URLlog.txt', 'r') as f:                # open my txt file with urls from previous scrapes
            urlx = [url.strip() for url in f.readlines()] # extract each url
            if urlx == item ["website_url"]:              # compare old url to URL being scraped
            raise DropItem('Item already in db')      # skip record if in url list
        return

im sure this code is wrong, can someone please suggest how i can do this, Im very new to this so explaining each line would help me alot. i hope my question makes sense and someone can help me

ive looked at these posts for help, but was not able to solve my problem:

How to Filter from CSV file using Python Script

Scrapy - Spider crawls duplicate urls

how to filter duplicate requests based on url in scrapy

Community
  • 1
  • 1

1 Answers1

0

Use the in keyword. Like so:

 if item['website_url'] in urlx:
      raise DropItem('Item already in db')

You loaded urlx from a file where each line is a url. It is now a list. The in keyword checks to see if the website url is in the list urlx. If it is, it returns true. Keep in mind the comparison is case sensitive in my example. You may want to call .lower() on the website url and on the urls loaded from the file.

There are more efficient ways of doing this, but I assume you just want something that works.

Eric Urban
  • 3,671
  • 1
  • 18
  • 23
  • thanks Eric, im so excited to get this to work. if you know of more efficient ways to do this i would love to hear, this whole scrapy project has been me copying and pasting snipets of code from the internet so my knowledge of best practice and most efficient ways to do things is very limited – user2636623 Aug 01 '13 at 04:08