Does a fast Python built-in method for reading lines and then splitting them exist?

Question

This method works just fine in Python:

with open(file) as f:
    for line in f:
        for field in line.rstrip().split('\t'):
            continue

However, it also means I read each line twice. First I loop over each character of the file and search for newline characters and second I loop over each character of the line and search for tab spaces. Is there a built-in method for splitting lines, while avoiding looping over the same set of characters twice? Apologies if this is a stupid question.

@thefourtheye Yes, I don't want to bother first splitting into lines with \n and then splitting into fields with \t. Given that each line has the same number of fields it should actually be quite straightforward with itertools.islice I just realized. — tommy.carstensen, Sep 04 '14 at 11:16
@AshwiniChaudhary Doesn't looping over a line mean I have to loop over the characters to get to the next newline character? — tommy.carstensen, Sep 04 '14 at 11:27
Is it used to read CSV files? In that case you can use `csv` module — Don, Sep 04 '14 at 12:13

score 4 · Accepted Answer · answered Sep 04 '14 at 11:58

4

If you're worried about this level of efficiency then you probably shouldn't be programming in Python. Most of what is happening in that loop happens in C (if you're using the CPython implementation). You're not going to find a more efficient way to process your data using a pure python approach or without creating a very complicated looping structure.

answered Sep 04 '14 at 11:58

Dunes

37,291
7
81
97

Thanks Dunes. I think this is the right answer. Maybe I should just delete the question... – tommy.carstensen Sep 04 '14 at 12:37

score 2 · Answer 2 · answered Sep 04 '14 at 12:05

If I wanted to avoid looping over the lines and handle the whole file in one go I would go with a regular expression. Also, regular expressions should be really fast.

import re
regexp = re.compile("\n+")
with open(file) as f:
   lines = re.split(regexp, f.read())

Now \n matches one or more newlines and splits the file there. The results is a python list with all the lines. If you want to split by another character, for example whitespaces (and tabs and newlines) you would replace \n+ with \s+. Depending on what you want to do with the lines this might not be faster. Timeit is your friend.

More on pythons regexp: https://docs.python.org/2/library/re.html

Does a fast Python built-in method for reading lines and then splitting them exist?

2 Answers2