0

I have a huge text file with integers. I need to process them line by line and save it in separate lists based on some calculations on numbers in each line

The end goal is to load the numbers(except line 1) in to two list - A = numbers at odd positions B = numbers at even positions

File sample:

1 3 4 5 
3 4 56 73
3 4 5 6

Currently I am doing as:

with open(filename) as f:
    for line in f:
        line = line.split()
        line_num = line_num + 1
        if line_num == 1:
            # do something
            pass
        if line_num > 1:
            line = [int(i) for i in line]
            for x in range(len(line)):
                # do something
                pass

The problem is, it is taking a lot of time. Is there a better way to do this fast?

mdurant
  • 27,272
  • 5
  • 45
  • 74
  • 3
    Are you sure it's not your processing that takes a long time? How long does it take to run? How long does it take if you comment out your calculations? – Joe Oct 01 '14 at 13:34
  • I need to read line by line to process it. If I remove the calculation it takes less time only. But how do I improve the performance for reading line by line? – NEW_PYTHON_LEARNER Oct 01 '14 at 13:41
  • @NEW_PYTHON_LEARNER you are already reading the file line by line: http://stackoverflow.com/questions/8009882/how-to-read-large-file-line-by-line-in-python – Paolo Moretti Oct 01 '14 at 13:43
  • 2
    What is your end goal here? Don't make it a [XY problem](http://meta.stackexchange.com/questions/66377/what-is-the-xy-problem). – Ashwini Chaudhary Oct 01 '14 at 13:43
  • The end goal is to load the numbers(except line 1) in to two list - A = numbers at odd positions B = numbers at even positions – NEW_PYTHON_LEARNER Oct 01 '14 at 13:46
  • 2
    To harden @Joe's hint: I have run your code (as after my edit) over a 60 MiB file with 1M lines and it was done processing in less than 10 seconds. That's only twice the time `hexdump /dev/urandom` took to create the file. – 5gon12eder Oct 01 '14 at 13:56

2 Answers2

1

Sounds like an efficient one for numpy:

X = numpy.loadtxt(filename)  #can specify if you know for sure all are integers
odds = X[1::2]
evens = X[::2]
mdurant
  • 27,272
  • 5
  • 45
  • 74
0

Instead of check whether the line is the first line every time, handle the first line at the beginning. No need to check inside the loop.

with open(filename) as f:
    line = next(f)
    # do something for the first line

    # handle rest lines
    for line in f:
        line = line.split()
        line = [int(i) for i in line]
        for field in line:
            # do something with field
            pass

I removed line_num because there's no use in the orignal code. But if you need it, use enumerate:

with open(filename) as f:
    line = next(f)

    for line_num, line in enumerate(f, 2):
        ...
falsetru
  • 357,413
  • 63
  • 732
  • 636
  • @Andrey, It will not improve performance drastically. I want to let OP know the check inside the loop is not needed. – falsetru Oct 01 '14 at 13:57