Extracting data from a very large text file using python and pandas?

Question

I'm trying to extract lines from a very large text file (10Gb). The text file contains the output from an engineering software (it's not a CSV file). I want to copy from line 1 to the first line containing the string 'stop' and then resume from the first line containing 'restart' to the end of the file.

The following code works but it's rather slow (about a minute). Is there a better way to do it using pandas? I have tried the read_csv function but I don't have a delimiter to input.

file_to_copy = r"C:\Users\joedoe\Desktop\C ANSYS R1\PATCHED\modes.txt"
output = r"C:\Users\joedoe\Desktop\C ANSYS R1\PATCHED\modes_extract.txt"
stop = '***** EIGENVECTOR (MODE SHAPE) SOLUTION *****'
restart = '***** PARTICIPATION FACTOR CALCULATION *****  X  DIRECTION'

with open(file_to_copy) as f:
    orig = f.readlines()

newf = open(output, "w")

write = True
first_time = True
for line in orig:
    if first_time == True:
        if stop in line:
            first_time = False
            write = False
            for i in range(300):
                newf.write(
                '\n  -------------------- MIDDLE OF THE FILE -------------------')
            newf.write('\n\n')
    if restart in line: write = True
    if write: newf.write(line)
newf.close()
print('Done.')

I thought I read on a website that read_csv in the way to go even for standard text files but I think I misunderstood! — pasei, Feb 28 '19 at 17:29
If that is the case I would like to see the link to where it says that! — Julian Silvestri, Feb 28 '19 at 17:31
It's here: https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html It says: "CSV & Text files The workhorse function for reading text files (a.k.a. flat files) is read_csv(). " — pasei, Feb 28 '19 at 17:42
@pasei they meant for reading text files into pandas, not in general, I think — Charles Landau, Feb 28 '19 at 18:40

score 2 · Accepted Answer · answered Feb 28 '19 at 17:06

2

readlines iterates over the whole file. Then you iterate over the result of readlines. I think the following edit will save you one whole iteration through the big file.

write = True
first_time = True

with open(file_to_copy) as f, open(output, "w") as newf:
    for line in f:
      if first_time == True:
          if stop in line:
              first_time = False
              write = False
              for i in range(300):
                  newf.write(
                  '\n  -------------------- MIDDLE OF THE FILE -------------------')
              print('\n\n')
      if restart in line: write = True
      if write: newf.write(line)
print('Done.')

answered Feb 28 '19 at 17:06

Charles Landau

4,187
1
8
24

Thank you good idea, I just tried and it shaves 2 seconds out of 76. Would there be a way to find the two lines I'm interested in without looping? The stop and restart strings are partial strings of a complete line. – pasei Feb 28 '19 at 17:38
You are iterating through the file one time in this instance. As far as I can tell any further speedup would come from speeding up the processing of the lines – Charles Landau Feb 28 '19 at 17:42
Thank you! Actually I just realized I should probably loop from the end of the file to find the second string, since I know it's near the end. That should save some time! – pasei Feb 28 '19 at 17:53
Excellent @pasei, `seek` may also help you – Charles Landau Feb 28 '19 at 17:59

score 0 · Answer 2 · answered Feb 28 '19 at 17:11

0

You should use python generators. Also printing makes the process slower.

Following are few examples to use generators:

Python generator to read large CSV file

Lazy Method for Reading Big File in Python?

answered Feb 28 '19 at 17:11

Bradia

827
5
8

Extracting data from a very large text file using python and pandas?

2 Answers2