1

I'm curious if there is a faster way to deal with big text files... I need to read quite a huge TXT file ( around 40MB :-/ ) which contains data separated by ";", remove the first 10 line as they are informative and not the main data, and then output just the first instance from each line so line[0] after split...

My current code does what it needs to, but it takes forever...

def remove_lines(input, output):
    lines = open(input).readlines()
    # clean the file before use
    open('output', 'w').close()
    # remove first lines and split by ; then output
    for l in lines[10:]:
        l = l.split(';')
        open(output, 'a').write(l[0] + "\n")

Its not like I need to do it often, maybe once a week so I can let it crunch as long as I want, but I'm curious if it can be speed up somehow...

VladoPortos
  • 563
  • 5
  • 21
  • 1
    Make sure you move `open(output, 'a')` _outside_ of the for loop. You're constantly reopening the file for no reason. – roganjosh Dec 08 '17 at 09:59
  • oh shoot, you are right, going to try it outside loop – VladoPortos Dec 08 '17 at 10:02
  • 1
    See: https://stackoverflow.com/questions/19508703/how-to-open-a-file-through-python and others. You should really be using `with` to handle your files. – roganjosh Dec 08 '17 at 10:02

3 Answers3

2

If you think 40MB is huge, you haven't seen huge ;) Either way, you don't need to read the whole file in memory, nor you need to split the whole line - it's sufficient just to skip the first n lines while reading and then get the line content up until the first semi-column, something like:

def remove_lines(input_file, output_file):
    with open(input_file, "r") as f_in, open(output_file, "a") as f_out:
        for i, line in enumerate(f_in):  # read the input line by line and enumerate it
            if i > 9:  # we're not interested in the first 10 lines
                sc_index = line.find(";")  # find the position of the first ; in line
                if sc_index != -1:  # found the first semi-column, get the content up to it
                    f_out.write(line[:sc_index] + "\n")  # write it to the output file
                else:
                    f_out.write(line)  # write the whole line as we couldn't find a ;

UPDATE: For the folks who think that str.split("delim", 1) is faster than finding the actual position and manually slicing, here's a simple test:

import timeit

def func_split(data):
    return data.split(";", 1)[0]

def func_find(data):
    index = data.find(";")
    if index != -1:
        return data[:index]
    return data


test1 = "A quick; brown; fox; with; semi; columns."
test2 = "A quick brown fox without semi columns."

assert func_split(test1) == func_find(test1)
assert func_split(test2) == func_find(test2)

if __name__ == "__main__":
    print("func_split:", timeit.timeit("func_split(test1); func_split(test2)",
                                       "from __main__ import func_split, test1, test2",
                                       number=1000000))
    print("func_find: ", timeit.timeit("func_find(test1); func_find(test2)",
                                       "from __main__ import func_find, test1, test2",
                                      number=1000000))

And the results

CPython 2.7.11 x64 (1000000 loops):
('func_split:', 6.877725868989936)
('func_find: ', 6.228281754820999)

CPython 3.5.1 x64 (100000 loops):
func_split: 0.8343849130147841
func_find:  0.8080772353660183

YMMV, of course, but in general the latter will always be faster on CPython, and the speed difference will increase with each character added to the string as the str.find() doesn't need to pick up the whole string 'til the end nor needs to create a list to store it.

zwer
  • 24,943
  • 3
  • 48
  • 66
  • very nice solution as well.. in my case literally what was needed is to put open file outside the loop as @roganjosh suggested, it jumped from 30 min to 1.1s :D – VladoPortos Dec 08 '17 at 10:09
  • I'm not sure where the comma comes into this? It looks delimited by `;` and split will still give back a single element list if there is no delimiter meaning that the 0th index should always be valid? – roganjosh Dec 08 '17 at 10:09
  • There is no need to manually search for a delimiter. You can just do `line.split(';', 1)` – Eli Korvigo Dec 08 '17 at 10:10
  • @roganjosh - oops, i misread. But split will split the whole line without the need for it. – zwer Dec 08 '17 at 10:12
  • Aha, so you're short-circuiting? Fair enough. @VladoPortos you should still make not of the use of `with` in this answer, your other approach didn't close the file you were writing to (which would be handled by the context manager), and this is more pythonic in general. – roganjosh Dec 08 '17 at 10:14
  • This is how the function looks now: def remove_lines(input, output): lines = open(input).readlines() open(output, 'w').close() file = open(output, 'a') for l in lines[10:]: l = l.split(';') file.write(l[0] + "\n") file.close() This works super fast, and close the file after... should I still use "with" ? ( going to look into that one – VladoPortos Dec 08 '17 at 10:17
  • 1
    @VladoPortos yes, you should always use `with`. – roganjosh Dec 08 '17 at 10:18
  • @EliKorvigo - there is, if the speed is what you're after and the OP was asking for speeding up his existing code. – zwer Dec 08 '17 at 10:19
  • @zwer using split with the second argument is most likely faster :) – Eli Korvigo Dec 08 '17 at 10:24
1

First of all, you shouldn't open your output file in a loop. That would make your OS very unhappy. Then, you only need to split up to the first instance of ";". Splitting by all instances is useless work. There is no need in calling write manually in your case, use the print function instead. and there sure is no need in storing all your data in RAM.

def remove_lines(input, output):
    with open(output, 'w') as out, open(input) as lines:
        # remove the first 10 lines
        for _ in range(10):
            next(lines)
        for l in map(str.strip, lines):
            print(l.split(";", 1)[0], file=out)
Jan Zeiseweis
  • 3,718
  • 2
  • 17
  • 24
Eli Korvigo
  • 10,265
  • 6
  • 47
  • 73
0

As @zwer's solution, reading and processing in the loop allows I/O and cpu to be better utilized / interleaved ... And save memory. This is more readable to my eyes:

def remove_lines(in_file, out_file):
    with open(in_file) as input, open(out_file, 'w') as output:
        for _ in range(10):
            input.readline()
        for line in input:
            output.write(line.split(';',1)[0] + '\n')

With a small amount of cpu mixed with I/O there is not much more optimizing that can be done.

  • it does what I need in 1.4sec, this on other hand does it in 0.9s `def remove_lines(input, output): lines = open(input).readlines() open(output, 'w').close() file = open(output, 'a') for l in lines[10:]: l = l.split(';', 1) file.write(l[0] + "\n") file.close()` But seems like I'm using bad practice not using "with" so I need to check upon that – VladoPortos Dec 08 '17 at 10:34