0

I have a .log file, having size more than 1GB. I want to split this file into multiple files based on filesize. I have code below to split it. But It takes time to process the given log file and then split it.

My Code:

import subprocess
import math
import os
import json

file_path = "path/to/file"
file_size = os.path.getsize(file_path)
MAX_SIZE = 300000000
if file_size > MAX_SIZE:
    file_lines_str = str(subprocess.check_output(["wc", "-l", file_path]))
    num_of_files = math.ceil(file_size / MAX_SIZE)
    print(" Num of files", ":", num_of_files)
    file_lines = file_lines_str.split(" ")[0].split("'")[1]
    print("file_line is", " ", file_lines)
    file_lines_to_be_read = math.ceil(int(file_lines) / num_of_files)
    print("file lines to be read:", " ", file_lines_read)
    with open(file_path) as infile:
        for file_num in range(0, num_of_files):
            seek_lines = file_num * file_lines_read
            print("Seek_lines", seek_lines)
            max_size_file = (file_num + 1) * file_lines_read
            print("max_size_file", max_size_file)
            output_file_name = "file_name_" + "_" + str(file_num)

            with open(output_file_name, "a") as output:
                i = seek_lines
                while i < max_size_file:
                    line = infile.readline()
                    output.write(line)
                    i = i + 1
    os.remove(file_path)

This code is inefficient in a way:

1) I am using readline, which reads the full log file into memory. This is not memory efficient way.

2) I am splitting on lines, counting lines- takes some time too.

Is there any other way- optimized and efficient way of solving this problem? I am sure something should be there.

user15051990
  • 1,835
  • 2
  • 28
  • 42
  • If you simply want smaller files, read the big one line by line and create smaller files every X lines. If you need the exect sizing, use a rolling logfile appender to create smaller logs to begin with. Remove all those print statements to make it faster - printing is **SLOW** – Patrick Artner May 05 '19 at 20:43
  • https://stackoverflow.com/questions/16289859/splitting-large-text-file-into-smaller-text-files-by-line-numbers-using-python – Patrick Artner May 05 '19 at 20:59

1 Answers1

0

Python is very nice. But it's unlikely the interpreter will be able beat /usr/bin/split -l for speed or efficiency.

BTW, as a practical matter, many log files have "boring" line lengths, in that they do not vary widely, they do not e.g. have twelve-character lines at start and thousand-character lines at end. If you're willing to live with such assumptions, then just "taste" the first k=100 lines, and compute sum of their lengths. Then avg_line_length = total_length / k. Obtain the file size with getsize(). Divide that by avg_line_length to get estimated number of lines in the file.

Much faster than running wc -l.

J_H
  • 17,926
  • 4
  • 24
  • 44