0

I'm writing a crawl where I crawl pages on a website and collects links which I write in a file. I can think of two options as mentioned below. I'm using first method right now which I know is not efficient as there will be file open and close in each loop but it is safe in the sense as it will write into file and if code crashes for some reason I'll still have data in it.

I'm not sure about the 2nd method. What if it crashes and file is not able to be closed properly, will I've have data written on file?

Is there any other more efficient way to achieve this?

I'm only writing the pseudo code.

Method 1: collect all urls on a page and write it in the file, close the file and repeat

def crawl(max_pages):

    # do stuff
    
    while(page <= max_pages):
        #do stuff
        with open(FILE_NAME, 'a') as f:
            f.write(profile_url + '\n')
            f.close()
            

Method 2: Keep file opened, collect urls from all pages and close it in the very end

crawl(300)


def crawl(max_pages):

    # do stuff
    
    with open(FILE_NAME, 'a') as f:
        while(page <= max_pages):
            #do stuff
            f.write(profile_url + '\n')
            
    f.close()

crawl(300)
Raymond
  • 1,108
  • 13
  • 32
  • 1
    `f.close()` is not required in either use case as `with` does that for you. – JonSG Jan 18 '23 at 17:27
  • Does this answer your question? [How often does python flush to a file?](https://stackoverflow.com/questions/3167494/how-often-does-python-flush-to-a-file) – JonSG Jan 18 '23 at 17:30
  • Method 2 is optimum. Wrap your "#do stuff" code in try/except. Do not explicitly close the file handle when using a work manager – DarkKnight Jan 18 '23 at 17:30
  • why not use sqlite3 database? – buran Jan 18 '23 at 17:30
  • Since you mention crawling websites, it sounds like your `# do stuff` is the bulk of the execution time and the file open/close is relatively trivial. While not free, these open/write/close operations go to the operating system's file cache so are not hugely expensive. Since you have a rational reason to take that extra time, do it. – tdelaney Jan 18 '23 at 17:30

2 Answers2

1

Opening and closing a file repeatedly in a loop isn't optimized and may reduce the performance of your program. As you say if you want to don't lose the data while writing in a file cause of crash probability, you can use the flush() method in Python.

If you write something, and it ends up in the buffer (only), and the power is cut to your machine, that data is not on disk when the machine turns off.

You can also read this answer for more understanding of what is Flush and Buffer.

How to use it?

Just change your code like this:

with open(FILE_NAME, 'a') as f:
   while(page <= max_pages):
    f.write(profile_url + '\n')
    f.flush()

No need to use the close() method because you are using with as

  • 1
    I like this code, it's clean and efficient. What happens if there is crash? Will I have the data in file or not? Does with open also take care of closing file and saving data in file in the case of a crash? – Raymond Jan 18 '23 at 18:11
  • @Raymond The `with as` is a great way to ensure that your file will be closed automatically after the scope of the `with as` statement ends. Additionally, calling the `flush()` method after every `write()` method flushes the buffer, and data will be stored in the file. This technique guarantees that any writes you make are saved immediately. – Moein Arabi Jan 20 '23 at 09:02
0

I would probably do something like this.

def crawl(max_pages):
    url_list = []
    for page in max_pages:
        # Code to navigate to page        
        url = '''Code to get the url your looking for'''
        url_list.append(url)
    return url_list

def write_file(url_list, FILE_NAME):
    with open(FILE_NAME, 'a') as f:
        for url in url_list:
            f.write(url+'\n')

url_list = crawl(300)
write_file(url_list, 'output.txt')
Kaleb Fenley
  • 216
  • 1
  • 5