How do i ping list of urls (around 80k) using python. The url is given in the format "https://www.test.com/en/Doors-Down/Buffalo/pm/99000002/3991","99000002"
. So i need to remove the numbers after comma from url(,"99000002") and ping the rest to url to find which one of them shows 404 error code.I was able to remove the the last character using rsplit library.
df= '"https://www.test.com/en/Doors-Down/Buffalo/pm/99000002/3991","99000002"'
print(df.rsplit(',',1)[0])
I have the urls in a csv file.But how do i ping such a huge list of urls.
update
I did try the a solution but after some time i get an error MY code:
import csv
from urllib2 import urlopen
import urllib2
import split
import requests
with open('C:\Users\kanchan.jha\Desktop\pm\performer_metros.csv',"rU") as csvfile:
reader = csv.reader(csvfile)
output = csv.writer(open("C:\Users\kanchan.jha\Desktop\pm\pm_quotes.csv",'w'))
for row in reader:
splitlist = [i.split(',',1)[0] for i in row]
#output.writerow(splitlist)
#converting to string and removing the extra quotes and square bracket
url = str(splitlist)[1:-1]
urls = str(url.strip('\''))
content = urllib2.urlopen(urls).read()
if content.find('404') > -1:
output.writerow(splitlist)
csvfile.close()
The code runs for a while and then i get an error(pasted below).A output file is created but it contains only 10-15 urls having 404 error. It seems only a few urls are checked for error not all.
Traceback (most recent call last):
File "c:\Users\kanchan.jha\Desktop\file.py", line 27, in <module>
content = urllib2.urlopen(urls, timeout =1000).read()
File "C:\Python27\lib\urllib2.py", line 154, in urlopen
return opener.open(url, data, timeout)
File "C:\Python27\lib\urllib2.py", line 435, in open
response = meth(req, response)
File "C:\Python27\lib\urllib2.py", line 548, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Python27\lib\urllib2.py", line 473, in error
return self._call_chain(*args)
File "C:\Python27\lib\urllib2.py", line 407, in _call_chain
result = func(*args)
File "C:\Python27\lib\urllib2.py", line 556, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 404: Not Found