How do i have a list of url (around 80k) using python

Question

How do i ping list of urls (around 80k) using python. The url is given in the format "https://www.test.com/en/Doors-Down/Buffalo/pm/99000002/3991","99000002". So i need to remove the numbers after comma from url(,"99000002") and ping the rest to url to find which one of them shows 404 error code.I was able to remove the the last character using rsplit library.

df= '"https://www.test.com/en/Doors-Down/Buffalo/pm/99000002/3991","99000002"'
print(df.rsplit(',',1)[0])

I have the urls in a csv file.But how do i ping such a huge list of urls.

update

I did try the a solution but after some time i get an error MY code:

import csv
from urllib2 import urlopen
import urllib2
import split
import requests


with open('C:\Users\kanchan.jha\Desktop\pm\performer_metros.csv',"rU") as csvfile:
reader = csv.reader(csvfile)
output = csv.writer(open("C:\Users\kanchan.jha\Desktop\pm\pm_quotes.csv",'w'))
for row in reader:
    splitlist = [i.split(',',1)[0] for i in row]
    #output.writerow(splitlist)
#converting to string and removing the extra quotes and square bracket
    url = str(splitlist)[1:-1]
    urls = str(url.strip('\''))
    content = urllib2.urlopen(urls).read()
    if content.find('404') > -1:
        output.writerow(splitlist)

csvfile.close()

The code runs for a while and then i get an error(pasted below).A output file is created but it contains only 10-15 urls having 404 error. It seems only a few urls are checked for error not all.

Traceback (most recent call last):
File "c:\Users\kanchan.jha\Desktop\file.py", line 27, in <module>
content = urllib2.urlopen(urls, timeout =1000).read()
File "C:\Python27\lib\urllib2.py", line 154, in urlopen
return opener.open(url, data, timeout)
File "C:\Python27\lib\urllib2.py", line 435, in open
response = meth(req, response)
File "C:\Python27\lib\urllib2.py", line 548, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Python27\lib\urllib2.py", line 473, in error
return self._call_chain(*args)
File "C:\Python27\lib\urllib2.py", line 407, in _call_chain
result = func(*args)
File "C:\Python27\lib\urllib2.py", line 556, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 404: Not Found

Possible duplicate of [Ping a site in Python?](https://stackoverflow.com/questions/316866/ping-a-site-in-python) — bfontaine, Apr 03 '18 at 15:48
I don't think it is a duplicated with [ping a site in python]. The OP is talking about how to deal with 80K url, not how to ping a site. — Sphinx, Apr 03 '18 at 15:49
A idea, before ping, you may need to check the domain name (host+port)for each url, they may be duplicated, then you can filter out them before ping. it will save you a lot of time if exists many same domain name among 80K — Sphinx, Apr 03 '18 at 15:51
If you're going to iterate through a list of 80k URLs from a single domain, you need to implement pauses between each loop to make your program humanlike - failing to do so can get your IP blocked. — Evan Nowak, Apr 03 '18 at 15:55
@EvanNowak, i don't think ping same domain is neccessary, so pause is not required. — Sphinx, Apr 03 '18 at 15:57
then for filtered 80K urls, you have to create many threads to ping the host simultaneously. — Sphinx, Apr 03 '18 at 15:59
Yes.My problem is dealing with such large list of url. I can ping a single url and can find the status code. Any suggestion will be helpful. — Kanchan Jha, Apr 03 '18 at 16:04

score 1 · Accepted Answer · edited Feb 26 '23 at 04:50

You can use requests library and ping all the URLs one by one and collect the data on which one returned a 404. You can probably keep writing this data to disk instead of keeping it in memory if you want to preserve it.

import requests

# raw_string_urls is your list of 80k urls with string attached
raw_string_urls = ['"https://www.test.com/en/Doors-Down/Buffalo/pm/99000002/3991","99000002"', '"https://www.test.com/en/Doors-Down/Buffalo/pm/99000002/3991","99000002"', '"https://www.test.com/en/Doors-Down/Buffalo/pm/99000002/3991","99000002"', '"https://www.test.com/en/Doors-Down/Buffalo/pm/99000002/3991","99000002"']

not_found_urls = list()

# Iterate here on the raw_string_urls
# The below code could be executed for each url.
for raw_string_url in raw_string_urls:

    url = raw_string_url.split(',')[0].strip('"')

    r = requests.get(url)
    print(url)
    print(r.status_code)
    if r.status_code == 404:
        not_found_urls.append(url)

You can then dump not_found_urls list as JSON file or whatever you want.

Edword · Answer 2 · 2018-04-03T15:57:19.610

You can ping a url using Python Requests.

import requests

url = "https://stackoverflow.com/questions/49634166/how-do-i-have-a-list-of-url-around-80k-using-python"

response = requests.get(url)
print response.status_code
# 200

Once you have your urls, you can easily iterate through the list and send a get request, saving or printing the result per URL as per your requirements. Not sure if it's going to work seamlessly with such a big list though, and also please note we are assuming that every URL will be available with no authentication and that every URL will be valid, which I am not sure it is the case.

shahaf · Answer 3 · 2018-04-04T10:16:37.417

the is a snippet of infrastructure code to ping the urls using multi-threading,

a simple worker-queue model there is a queue with tasks and every worker (thread) spawn will listen to this queue and take tasks from it

by using multiple threads you can process 80K requests in a reasonable time

import threading, Queue, requests

pool = Queue.Queue()

num_worker_threads = 10

def ping(url):
  # do a ping to the url return True/False or whatever you want..
  response = requests.get(url)
  if response.status_code != 200:
    return False
  return True

def worker():
  while True:
    url = pool.get()
    try:
      response = ping(url)
      # check if response is ok and do stuff (printing to log or smt)
    except Exception as e:
      pass
    pool.task_done()


for i in range(num_worker_threads):
  t = threading.Thread(target=worker, args=())
  t.setDaemon(True)
  t.start()

urls = [...] #list of urls to check

for url in urls:
  pool.put(url)

pool.join()

How do i have a list of url (around 80k) using python

3 Answers3