2

I am having trouble in checking existing data in scrapy. i have used elasticsearch as my database below code i am trying to execute ??

def checkIfURLExistsInCrawler(single_url):
    elastic_query = json.dumps({
       "query": {
              "match_phrase": {
                       "url": single_url
               }
        }
    })

    result = es.search(index='test', doc_type='test', body=elastic_query)['hits']['hits']
    return result

def start_requests(self):
    urls = [

    # here i have some url there might be chance 
    # that some urls are duplicate so i have to put 
    # validation but in for loop it doesn't working  

    ]

    for request_url in urls:
        checkExists = self.checkIfURLExistsInCrawler(request_url)

        if not checkExists :
            beingCrawledUrl = {}
            beingCrawledUrl['url'] = single_url
            beingCrawledUrl['added_on'] = now.strftime("%Y-%m-%d %H:%M:%S")
            json_data = json.dumps(beingCrawledUrl)
            InsertData = es.index(index='test', doc_type='test', body=json.loads(json_data))
            yield scrapy.Request();

if i execute this code all record inside urls = [ ] are inserted into "test" index even if its duplicated because of validation i put above is not working .

but if i run this again with same data validation works .so please can any one help this out.

vezunchik
  • 3,669
  • 3
  • 16
  • 25
Milan Hirpara
  • 166
  • 1
  • 9
  • Please clarify first: if you remove all the ElasticSearch related code, does the crawler reach all the expected URLs? You are yielding an empty Request, which does not seem correct. – malberts Feb 14 '19 at 07:43
  • And when you say your `urls` list may contain duplicates, are they *exact* duplicates? Because then you must either just clean the list before you put it in the code, or remove duplicates like here: https://stackoverflow.com/questions/7961363/removing-duplicates-in-lists – malberts Feb 14 '19 at 07:48
  • And can you confirm that your ES queries are correct when you use it outside of Scrapy? – malberts Feb 14 '19 at 07:49
  • @malberts, in code it does not have any scrapy execution problem . issue i am facing is that when for loop start sending url from array the validation that i have put is not working. E.G 10 url are in URLS array 3 of them are exact duplicate so when i run crawler all 10 url are inserted into DB instead of 7. now if i run crawler second time with same data then validation works and no data inserted in to DB . so thats the issue i am facing because of that i have lots of duplicated data into my ES db – Milan Hirpara Feb 14 '19 at 09:16
  • This sounds like an ES issue, not a Scrapy issue. It might have to do with how you initialise the ES connection which, I assume, is done when the spider starts. There might be some caching here where during a single run it is not getting updated results back. But I haven't used ES in many years, so I won't be able to debug that. My suggestion is to extract the ES code into a standalone Python script to confirm that it does what you expect, i.e. (1) connect to ES; (2) do your `checkIfURLExistsInCrawler` check; (3) do the `InsertData` line; (4) do the `checkIfURLExistsInCrawler` check again. – malberts Feb 14 '19 at 09:22
  • 1
    @malberts Thanks for the comment indeed its ES issue and i have found solution as well. problem was that ES operations(e.g. insert,update,delete) does not reflect immideatly .it take aroung 1 second to reflect any document to be rready in search reasult that's why validation not working . need to refresh index at operation time and issue solve. any how thank for looking into it really appreciate it. – Milan Hirpara Feb 14 '19 at 12:30

0 Answers0