2

This method works just fine in Python:

with open(file) as f:
    for line in f:
        for field in line.rstrip().split('\t'):
            continue

However, it also means I read each line twice. First I loop over each character of the file and search for newline characters and second I loop over each character of the line and search for tab spaces. Is there a built-in method for splitting lines, while avoiding looping over the same set of characters twice? Apologies if this is a stupid question.

tommy.carstensen
  • 8,962
  • 15
  • 65
  • 108

2 Answers2

4

If you're worried about this level of efficiency then you probably shouldn't be programming in Python. Most of what is happening in that loop happens in C (if you're using the CPython implementation). You're not going to find a more efficient way to process your data using a pure python approach or without creating a very complicated looping structure.

Dunes
  • 37,291
  • 7
  • 81
  • 97
2

If I wanted to avoid looping over the lines and handle the whole file in one go I would go with a regular expression. Also, regular expressions should be really fast.

import re
regexp = re.compile("\n+")
with open(file) as f:
   lines = re.split(regexp, f.read())

Now \n matches one or more newlines and splits the file there. The results is a python list with all the lines. If you want to split by another character, for example whitespaces (and tabs and newlines) you would replace \n+ with \s+. Depending on what you want to do with the lines this might not be faster. Timeit is your friend.

More on pythons regexp: https://docs.python.org/2/library/re.html

Daniel Karlsson
  • 126
  • 1
  • 10