Error while using threads to scrape a website

Question

I am trying to build a bot that will scrap the purchase history of purchased domains. So far I was able to extract the domain from the csv file and store them into a list (PS: there are 10k domains). The problem accures when I am trying to scrap the website with them. I have tried doing this with two domains and it works perfectly. Does anyone know what error is this and how I can fix it? Thank you very much in advance.

my code:

datafile = open('/Users/.../Documents/Domains.csv', 'r')
myreader = csv.reader(datafile, delimiter=";",)
domains   = []
for row in myreader:
    domains.append(row[1])
del domains[0]
print("The Domains have been stored into a list")

nmb_sells_record = 0

def result_catcher(domains,queue):
    template_url = "https://namebio.com/{}".format(domain)
    get = requests.get(template_url)
    results = get.text
    last_sold =  results[results.index("last sold for ")+15:results.index(" on 2")].replace(",","")
    last_sold = int(last_sold)
    if not "No historical sales found." in results:
        sold_history = results[results.index("<span class=\"label label-success\">"):results.index(" USD</span> on <span class=\"label")]
    queue.put(results)

#domains = ["chosen.com","koalas.com"]
queues = {}
nmb=0
for x in range(len(domains)):
    new_queue = "queue{}".format(nmb)
    queues[new_queue] = queue.Queue()
    nmb += 1
count = 0
for domain in domains:
    for queue in queues: 
        t = threading.Thread(target=result_catcher, args=(domain,queues[queue]))
        t.start()
print("The Requests were all sent, now they are beeing analysed")   
for queue in queues:
    response_domain = queues[queue].get()
    nmb_sells_record = response_domain.count("for $") + response_domain.count("USD")


print("The Bot has recorded {} domain sells".format(nmb_sells_record))

The output of my code:

Exception in thread Thread-345:
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/urllib3/connection.py", line 141, in _new_conn
    (self.host, self.port), self.timeout, **extra_kw)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/urllib3/util/connection.py", line 60, in create_connection
    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/socket.py", line 743, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno 8] nodename nor servname provided, or not known

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/urllib3/connectionpool.py", line 601, in urlopen
    chunked=chunked)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/urllib3/connectionpool.py", line 346, in _make_request
    self._validate_conn(conn)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/urllib3/connectionpool.py", line 850, in _validate_conn
    conn.connect()
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/urllib3/connection.py", line 284, in connect
    conn = self._new_conn()
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/urllib3/connection.py", line 150, in _new_conn
    self, "Failed to establish a new connection: %s" % e)
urllib3.exceptions.NewConnectionError: <urllib3.connection.VerifiedHTTPSConnection object at 0x115a55a20>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known

7stud · Answer 1 · 2018-01-13T16:17:56.750

1

From the python docs:

exception socket.gaierror A subclass of OSError, this exception is raised for address-related errors by getaddrinfo() and getnameinfo().

The accompanying value is a pair (error, string) representing an error returned by a library call. string represents the description of error, as returned by the gai_strerror() C function. The numeric error value will match one of the EAI_* constants defined in this module.

gai => get address info

From the urllib3 wikipage:

New exception: NewConnectionError, raised when we fail to establish a new connection, usually ECONNREFUSED socket error.

Some possible reasons for an ECONNREFUSED error here along with some commandline commands to probe the address and port.

By the way, instead of reading all the rows into an array, then deleting the first item in the array, which makes python slide all the other items over one spot, you can more efficiently skip the header(?) like this:

myreader = csv.reader(datafile, delimiter=";",)
next(my_reader)  #<==== HERE ****

domains   = []

for row in myreader:
    domains.append(row[1])

next() will throw a StopIteration exception if there isn't a next row. If you want to prevent that, you can call next(my_reader, None), which will return None if there is no next row.

Threading example:

import requests
import threading

resources = [
    "dfactory.com",
    "dog.com",
    "cat.com",
]

def result_catcher(resource):
    template_url = "https://namebio.com/{}".format(resource)
    get = requests.get(template_url)


threads = []

for resource in resources:
    t = threading.Thread(target=result_catcher, args=(resource,) )
    t.start()
    threads.append(t)

for thread in threads:
    thread.join()

print("All threads done executing.")

By the way, there will be an optimal number of threads to start, which is less than N. Create a thread pool, and when one thread is done have it go back and read another resource path from a worker queue. You'll have to run some tests to figure out how many threads are optimal. Creating 10,000 threads is not optimal. If you have four cores, as few as 10 threads may be optimal.

edited Jan 13 '18 at 16:17

answered Jan 13 '18 at 13:49

7stud

46,922
14
101
127

Oh, I see, I will take a look at the urlib3 wikipage now. Do you have an idea on how to fix this? – Nazim Kerimbekov Jan 13 '18 at 13:51
@Fozoro, Can you `print(template_url)` and post the url that causes the error? – 7stud Jan 13 '18 at 14:17
good call, the url is https://namebio.com/dfactory.com what makes all this even more strange is that dfactory.com is the first domain in the list – Nazim Kerimbekov Jan 13 '18 at 14:27
What happens if you try: `get = requests.get("https://namebio.com/dfactory.com")`? I do not get an error when I do that. – 7stud Jan 13 '18 at 14:34
I would assume that this is due to the fact that it will just skip the url that causes the problem (PS: I don’t think that you can actually get an error when using try:) – Nazim Kerimbekov Jan 13 '18 at 14:43
@Fozoro, *What happens if you try:* -- That's English--not python code. The python code is the highlighted stuff. I was asking you what result you get if you execute that line of code? – 7stud Jan 13 '18 at 14:46
oh my bad, well when I execute it this way it doesn’t give it works great the problem is that I have to do this for 10000 domains so template_url is inevitable (when I use this code with a list of two domains it works great), the problem appears when I try doing it this with a bigger list. – Nazim Kerimbekov Jan 13 '18 at 14:56
@Fozorro, How many urls will your code work on before failing? – 7stud Jan 13 '18 at 15:42
@Fozorro well based on the fact that dfactory.com is the first URL in the list I would say 0 – Nazim Kerimbekov Jan 13 '18 at 15:44
So if you hard code a list with that single url, your code throws an error? But when you execute `get = requests.get("https://namebio.com/dfactory.com")`, there is no error? – 7stud Jan 13 '18 at 15:47
The error is thrown when I am trying to thread the operation. if I run it like this `for domain in domains: now = datetime.now() get = requests.get("https://namebio.com/{}".format(domain)) textj = get.text` no error is thrown the problem with this method is that it takes to much time – Nazim Kerimbekov Jan 13 '18 at 15:51
@Fozoro, Okay then, I also get no errors when threading either. Run the example I posted at the end of my answer. I cannot duplicate the error you are getting. – 7stud Jan 13 '18 at 16:04
I copy pasted the dictionary that I am using [here](https://justpaste.it/1fnn2). Please try using this as your dictionary, hopefully, you will see the same problem as me. – Nazim Kerimbekov Jan 13 '18 at 16:20
I am using an iMac 2017 with the following processor 3.8 GHz Intel Core i5 – Nazim Kerimbekov Jan 13 '18 at 16:26
@Fozoro, `Click on the Apple Icon on the far left of the menu bar => About this Mac => System Report button => Hardware => Total Number of Cores` – 7stud Jan 13 '18 at 16:30
@Forozo, *I copy pasted the dictionary that I am using here. Please try using this as your dictionary, hopefully, you will see the same problem as me.* It's called a *list* not a dictionary. OSX won't let me start that many threads--my example errors out with a `Too many open files error`. – 7stud Jan 13 '18 at 16:37
oh then I am probably having the same problem is there any way to solve this? – Nazim Kerimbekov Jan 13 '18 at 16:38
*oh then I am probably having the same problem is there any way to solve this?* Read the text under my threading example. – 7stud Jan 13 '18 at 16:40
@Fozoro, *Run the example I posted at the end of my answer. I cannot duplicate the error you are getting.* – 7stud Jan 13 '18 at 16:44
@Fozoro, [An example showing how to use queues to feed tasks to a collection of worker processes and collect the results](https://docs.python.org/2/library/multiprocessing.html#examples). Use 15 worker processes and that should be more than enough. – 7stud Jan 13 '18 at 16:45
10,000 requests? Imagine the impact they have on the server. It's almost a DOS attack! – t.m.adam Jan 13 '18 at 16:50
@t.m.adam, Yep. I was trying to feel out if a DoS prevention strategy might be causing the error, but the op said the code errored out on the first url. – 7stud Jan 13 '18 at 17:00
@7stud ip ban perhaps? – t.m.adam Jan 13 '18 at 17:04
@t.m.adam, No. OSX doesn't allow 10,000 threads to be started, and the op's errors were buried within the `Too many files open error`. There's a lot of error output, and the op randomly chose a portion of it. There's no issue with running 1,000 threads, but since 20 or fewer threads is likely optimal for 4 cores, it doesn't even matter what OSX's limit is. And the op can take advantage of all 4 cores on their computer if they use python's builtin multiprocessing module. Over and out. – 7stud Jan 13 '18 at 17:20
@t.m.adam @7stud Oh my god, I just found what was causing the problem. The problem had nothing to do with the number of requests I was sending but with the way constructed the for loop: `for domain in domains: for queue in queues:` so basically the code was applying 10k queues to each domain (instead of applying one queue per domain), by consequence this made the computer go insane. all I had to do is replace that piece of code with `for domain,queue in zip(domains,queues):`.I also added `sleep(0.02)` to give a more time to the comp. Now it works great. Thank you very much for your help. – Nazim Kerimbekov Jan 13 '18 at 18:24
@Forozo, Yes, your original code was starting 10,000 * N threads, but I simplified your code by creating an example that starts only 10,000 threads. Now, you have rewritten your code to start *only* 10,000 threads. I can assure you that your code is still shite. As I tried to tell you, you can use 20 threads to retrieve 10,000 resources and your code will be MORE efficient. *I also added sleep(0.02)* -> Another sign that your code is poorly written. You should also use the multiprocessing module so that you can use 4 cores simultaneously. But I can only lead a horse to water... – 7stud Jan 14 '18 at 06:21

Error while using threads to scrape a website

1 Answers1