0

I'm trying to extract lines from a very large text file (10Gb). The text file contains the output from an engineering software (it's not a CSV file). I want to copy from line 1 to the first line containing the string 'stop' and then resume from the first line containing 'restart' to the end of the file.

The following code works but it's rather slow (about a minute). Is there a better way to do it using pandas? I have tried the read_csv function but I don't have a delimiter to input.

file_to_copy = r"C:\Users\joedoe\Desktop\C ANSYS R1\PATCHED\modes.txt"
output = r"C:\Users\joedoe\Desktop\C ANSYS R1\PATCHED\modes_extract.txt"
stop = '***** EIGENVECTOR (MODE SHAPE) SOLUTION *****'
restart = '***** PARTICIPATION FACTOR CALCULATION *****  X  DIRECTION'

with open(file_to_copy) as f:
    orig = f.readlines()

newf = open(output, "w")

write = True
first_time = True
for line in orig:
    if first_time == True:
        if stop in line:
            first_time = False
            write = False
            for i in range(300):
                newf.write(
                '\n  -------------------- MIDDLE OF THE FILE -------------------')
            newf.write('\n\n')
    if restart in line: write = True
    if write: newf.write(line)
newf.close()
print('Done.')
pasei
  • 94
  • 1
  • 10

2 Answers2

2

readlines iterates over the whole file. Then you iterate over the result of readlines. I think the following edit will save you one whole iteration through the big file.

write = True
first_time = True

with open(file_to_copy) as f, open(output, "w") as newf:
    for line in f:
      if first_time == True:
          if stop in line:
              first_time = False
              write = False
              for i in range(300):
                  newf.write(
                  '\n  -------------------- MIDDLE OF THE FILE -------------------')
              print('\n\n')
      if restart in line: write = True
      if write: newf.write(line)
print('Done.')
Charles Landau
  • 4,187
  • 1
  • 8
  • 24
  • Thank you good idea, I just tried and it shaves 2 seconds out of 76. Would there be a way to find the two lines I'm interested in without looping? The stop and restart strings are partial strings of a complete line. – pasei Feb 28 '19 at 17:38
  • You are iterating through the file one time in this instance. As far as I can tell any further speedup would come from speeding up the processing of the lines – Charles Landau Feb 28 '19 at 17:42
  • Thank you! Actually I just realized I should probably loop from the end of the file to find the second string, since I know it's near the end. That should save some time! – pasei Feb 28 '19 at 17:53
  • Excellent @pasei, `seek` may also help you – Charles Landau Feb 28 '19 at 17:59
0

You should use python generators. Also printing makes the process slower.

Following are few examples to use generators:

Python generator to read large CSV file

Lazy Method for Reading Big File in Python?

Bradia
  • 827
  • 5
  • 8