2

I've been trying for a long time to write the results to my file, but since it's a multithreaded task, the files are written in a mixed way

The task that adds the file is in the get_url function

And this fonction is launched via pool.submit(get_url,line)

import requests
from concurrent.futures import ThreadPoolExecutor
import fileinput
from bs4 import BeautifulSoup
import traceback
import threading


from requests.packages.urllib3.exceptions import InsecureRequestWarning
import warnings

requests.packages.urllib3.disable_warnings(InsecureRequestWarning)

warnings.filterwarnings("ignore", category=UserWarning, module='bs4')

count_requests = 0
host_error = 0

def get_url(url):

    try:
        global count_requests
        result_request = requests.get(url, verify=False)
        soup = BeautifulSoup(result_request.text, 'html.parser')

   
        with open('outfile.txt', 'a', encoding="utf-8") as f:
            f.write(soup.title.get_text())
            
        count_requests = count_requests + 1
    except:
        global host_error
        host_error = host_error + 1




with ThreadPoolExecutor(max_workers=100) as pool:
    for line in fileinput.input(['urls.txt']):
        pool.submit(get_url,line)
        print(str("requests success : ") + str(count_requests) + str(" | requests error ") + str(host_error), end='\r')
    


    

This is what the output looks like :

google.com - Google

w3schools.com - W3Schools Online Web Tutorials

Hai Vu
  • 37,849
  • 11
  • 66
  • 93
yand
  • 21
  • 3
  • 1
    What's `def get_url(ip + url):`? Please post your program output as well. Check [\[SO\]: How do I ask a good question?](https://stackoverflow.com/help/how-to-ask) or [\[SO\]: How to create a Minimal, Reproducible Example (reprex (mcve))](https://stackoverflow.com/help/minimal-reproducible-example) for more asking related details. Also, [\[JonSkeet.CodeBlog\]: WRITING THE PERFECT QUESTION](https://codeblog.jonskeet.uk/2010/08/29/writing-the-perfect-question) might be a good point to start. – CristiFati Jul 09 '22 at 23:07
  • 2
    Wrong solution to a problem. If you require the results to be written in a certain order, why multithread then? Or store the results in a temporary file, which another thread sorts before writing it to the final output file? But even then, how would that thread know that no further output will be coming later? Maybe write the output to a queue which is sorted and dumped to the final file after an assumed delay? Simpler to sort at the end, when all your urls are processed. – Nic3500 Jul 09 '22 at 23:09
  • Sorry for the (ip + url), that was a mistake, I edited – yand Jul 09 '22 at 23:09
  • I use multithread because I need to make many HTTP requests, and I would like to save the result (title of the page) in a file as requests are made – yand Jul 09 '22 at 23:12
  • The output simply contains the site URL and the page title – yand Jul 09 '22 at 23:12
  • Do you care which order they are written in, or just that they don't get interleaved? – tdelaney Jul 09 '22 at 23:22
  • That they don't get interleaved – yand Jul 09 '22 at 23:25
  • Pass the line number to each instance of `get_url()` and have it save its result to a different file each time, which is named with the use of the line number. When all threads in the pool are finished, have your main thread go through those files and merge them - i.e. combine their texts to `outfile.txt`. (don't forget to delete them). Sorting also takes care of itself that way. – kyriakosSt Jul 09 '22 at 23:25
  • https://stackoverflow.com/questions/2301458/python-multiple-threads-accessing-same-file – gabriel Jul 09 '22 at 23:39
  • You might consider having a separate file-writing process -- it writes to the file, and the other processes send text to the file-writing process to be written. That way you only have one process writing to the file, so you have better control over how things get written/interleaved. – Jeremy Friesner Jul 10 '22 at 18:18

2 Answers2

1

You can use multiprocessing.Pool and pool.imap_unordered to receive processed results and write it to the file. That way the results are written only inside main thread and won't be interleaved. For example:

import requests
import multiprocessing
from bs4 import BeautifulSoup


def get_url(url):
    # do your processing here:
    soup = BeautifulSoup(requests.get(url).content, "html.parser")
    return soup.title.text


if __name__ == "__main__":
    # read urls from file or other source:
    urls = ["http://google.com", "http://yahoo.com"]

    with multiprocessing.Pool() as p, open("result.txt", "a") as f_out:
        for result in p.imap_unordered(get_url, urls):
            print(result, file=f_out)
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
  • Additionally, I think setting chunksize to a low value (like 1) is good. Otherwise urls are submitted to threads in chunks and you have to wait for the slowest chunk of urls to complete. – tdelaney Jul 09 '22 at 23:41
0

I agree with Andrej Kesely that we should not write to file within get_url. Here is my approach:

from concurrent.futures import ThreadPoolExecutor, as_completed

def get_url(url):
    # Processing...
    title = ...
    return url, title


if __name__ == "__main__":
    with open("urls.txt") as stream:
        urls = [line.strip() for line in stream]

    with ThreadPoolExecutor() as executor:
        urls_and_titles = executor.map(get_url, urls)

    # Exiting the with block: all tasks are done
    with open("outfile.txt", "w", encoding="utf-8") as stream:
        for url, title in urls_and_titles:
            stream.write(f"{url},{title}\n")

This approach waits until all tasks completed before writing out the result. If we want to write out the tasks as soon as possible:

from concurrent.futures import ThreadPoolExecutor, as_completed

...

if __name__ == "__main__":
    with open("urls.txt") as stream:
        urls = [line.strip() for line in stream]

    with ThreadPoolExecutor() as executor, open("outfile.txt", "w", encoding="utf-8") as stream:
        futures = [
            executor.submit(get_url, url)
            for url in urls
        ]
        for future in as_completed(futures):
            url, title = future.result()
            stream.write(f"{url},{title}\n")

The as_completed() function will take care to order the Futures object so the ones completed first is at the beginning of the queue.

In conclusion, the key here is for the worker function get_url to return some value and do not write to file. That task will be done in the main thread.

Hai Vu
  • 37,849
  • 11
  • 66
  • 93