0

I have millions of domains which I will send WHOIS query and record WHOIS response on some .txt file.

I would like to set maximum capacity for a single .txt output file. For example, let's say I started recording responses on out0.txt. I want to switch to out1.txt if out0.txt is >= 100mb. Same thing goes for out1.txt, if out1.txt>=100mb then start writing to out2.txtand so on.

I know that I can do if checks after each insertion, but I want my code to be fast: i.e. I thought if checks at each domain can slow down my code. (It will asynchronously query millions of domains).

I imagined a try-except block could solve my issue here, like this:

folder_name = "out%s.txt"
folder_number = 0


folder_name = folder_name % folder_number
f = open(folder_name, 'w+')

for domain in millions_of_domains:
   try:
      response_json = send_whois_query(domain)
      f.write(response_json)
   except FileGreaterThan100MbException:
      folder_number += 1
      folder_name = folder_name % folder_number
      f = open(folder_name, 'w+')
      f.write(response_json)

Any suggestions will be appreciated. Thank you for your time.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
Ali Yılmaz
  • 1,657
  • 1
  • 11
  • 28
  • What should happen to data that doesn't fit? Shoud it be written to the next file as a whole? Or should the data be split? – Martijn Pieters Jun 20 '18 at 12:30
  • Note that you *can't just write multiple JSON documents to a file*, not without a delimiter. The JSONLines format uses newlines as the delimiter, and disallows newlines in the JSON documents themselves. Something to think about. – Martijn Pieters Jun 20 '18 at 12:31
  • @MartijnPieters I'm asynchronously sending queries and fetching JSON responses, which are independent from each other. The JSON file should stay together as a whole, i.e. data can't be split from a single response. – Ali Yılmaz Jun 20 '18 at 12:32
  • well, good point. I can wrap them under a big JSON file and put them together as elements of an array, and I can keep that array as a value to some key. – Ali Yılmaz Jun 20 '18 at 12:33
  • But then you can no longer split the file. :-) – Martijn Pieters Jun 20 '18 at 12:35
  • I dont understand why I can't split the file. Let's say I put some numbers in that array. out0.txt''s JSON should look like: `{"numbers":[1,2,3]}` and out1.txt's JSON will be like `{"numbers":[4,5,6]}`. replace the numbers with JSON responses from WHOIS call. I can't see what I'm missing here. :) – Ali Yılmaz Jun 20 '18 at 12:37
  • 1
    Ah, I misunderstood how you were planning to split. Yes, splitting the data by creating separate documents is fine. I have too often seen people try to put everything in one big JSON array (up to and including writing `[` first, then a comma after each document, and `]` at the end). – Martijn Pieters Jun 20 '18 at 14:02
  • That said, it's [trivial to parse a JSON Lines formatted file](https://stackoverflow.com/questions/12451431/loading-and-parsing-a-json-file-with-multiple-json-objects-in-python). If those WHOIS query JSON responses contain newlines, just use `'response_json.replace('\n', '')` to convert those into a JSON Lines compatible format. Newlines in JSON are always extra whitespace that can be ignored. – Martijn Pieters Jun 20 '18 at 14:08

1 Answers1

1

You can create a wrapper object that tracks how much data has been written, and opens a new file if you reached a limit:

class MaxSizeFileWriter(object):
    def __init__(self, filenamepattern, maxdata=2**20,  # default 1Mb
                 start=0, mode='w', *args, **kwargs):
        self._pattern = filenamepattern
        self._counter = start

        self._mode = mode
        self._args, self._kwargs = args, kwargs

        self._max = maxdata

        self._openfile = None
        self._written = 0

    def _open(self):
        if self._openfile is not None:
            filename = self._pattern.format(self._counter)
            self._counter += 1
            self._openfile = open(filename, mode=self._mode, *self._args, **self._kwargs)

    def _close(self):
        if self._openfile is not None:
            self._openfile.close()

    def __enter__(self):
        return self

    def __exit__(self, *args, **kwargs):
        if self._openfile is not None:
            self._openfile.close()

    def write(self, data):
        if self._written + len(data) > self._max:
            # current file too full to fit data too, close it
            # This will trigger a new file to be opened.
            self._close()
        self._open()  # noop if already open
        self._openfile.write(data)
        self._written += len(data)

The above is a context manager, and can be used just like a regular file. Pass in a filename with a {} placeholder for the number to be inserted into:

folder_name = "out{}.txt"

with MaxSizeFileWriter(folder_name, maxdata=100 * 2**10) as f:
    for domain in millions_of_domains:
        response_json = send_whois_query(domain)
        f.write(response_json)
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343