0

I have a very large text file (50,000+ lines) that should always be in the same sequence. In python I want to search the text file for each of the $INGGA lines and join this line with the subsequent $INHDT to create a new text file. I need to do this without reading into memory as this causes it to crash every time. I can find return the $INGGA line but I'm not sure of the best way of then getting the next line and joining into a new string that is memory efficient

Thanks

Phil

=~=~=~=~=~=~=~=~=~=~=~= PuTTY log 2016.05.06 09:11:34 =~=~=~=~=~=~=~=~=~=~=~= > $PRDID,2.15,-0.10,31.87*6E $INGGA,091124.00,5249.8336,N,00120.9619,W,1,20,0.6,95.0,M,49.4,M,,*50 $INHDT,31.9,T*1E $INZDA,091124.0055,06,05,2016,,*7F $INVTG,22.0,T,,M,4.4,N,8.1,K,A*24 $PRDID,2.13,-0.06,34.09*6C $INGGA,091124.20,5249.8338,N,00120.9618,W,1,20,0.6,95.0,M,49.4,M,,*5D $INHDT,34.1,T*13 $INZDA,091124.2055,06,05,2016,,*7D $INVTG,24.9,T,,M,4.4,N,8.1,K,A*2B $PRDID,2.16,-0.03,36.24*61 $INGGA,091124.40,5249.8340,N,00120.9616,W,1,20,0.6,95.0,M,49.4,M,,*5A $INHDT,36.3,T*13 $INZDA,091124.4055,06,05,2016,,*7B $INVTG,27.3,T,,M,4.4,N,8.1,K,A*22 $PRDID,2.11,-0.05,38.33*68 $INGGA,091124.60,5249.8343,N,00120.9614,W,1,20,0.6,95.1,M,49.4,M,,*58 $INHDT,38.4,T*1A $INZDA,091124.6055,06,05,2016,,*79 $INVTG,29.5,T,,M,4.4,N,8.1,K,A*2A $PRDID,2.09,-0.02,40.37*6D $INGGA,091124.80,5249.8345,N,00120.9612,W,1,20,0.6,95.1,M,49.4,M,,*56 $INHDT,40.4,T*15 $INZDA,091124.8055,06,05,2016,,*77 $INVTG,31.7,T,,M,4.4,N,8.1,K,A*21 $PRDID,2.09,0.02,42.42*40 $INGGA,091125.00,5249.8347,N,00120.9610,W,1,20,0.6,95.1,M,49.4,M,,*5F $INHDT,42.4,T*17

  • 3
    Please [edit] your question and include what you have tried. Does it work? –  Jun 14 '16 at 08:40
  • Could you elaborate on what you want to do with the `$INGGA`-`$INHDT` pairs when you have found them? Store all in another file? Store each pair in a separate file? – thorbjornwolf Jun 14 '16 at 09:22

4 Answers4

2

You can just read a line of file and write to another new file. Like this:

import re

#open new file with append
nf = open('newfile', 'at')

#open file with read 
with open('file', 'rt') as f:
    for line in f:
        r = re.match(r'\$INGGA', line)
        if r is not None:
            nf.write(line)
            nf.write("$INHDT,31.9,T*1E" + '\n')

You can use at to append write and wt to read line!

I have 150,000 lines file, It's run well!

armatita
  • 12,825
  • 8
  • 48
  • 49
SmartAn
  • 20
  • 4
0

I suggest using a simple regex that will parse and capture the parts you care about. Here is an example that will capture the piece you care about:

(\$INGGA.*\n\$INHDT.*\n)

https://regex101.com/r/tK1hF0/3

As in my above link, you'll notice that I used the "global" g setting on the regex, telling it to capture all groups that match. Otherwise, it'll stop after the first match.

I also had trouble determining where the actual line breaks exist in your above example file, so you can tweak the above to match exactly where the breaks occur.

Here is some starter python example code:

import re

test_str = # load your file here

p = re.compile(ur'(\$INGGA.*\n\$INHDT.*\n)')
matches = re.findall(p, test_str)
wilkesybear
  • 528
  • 3
  • 9
0

In the example PuTTY log you give, its all one line separated with space. So in this case you can use this to replace the space with new line and gets new file -

cat large_file | sed 's/ /\n/g' > new_large_file

To iterate over the file separated with new line, run this -

cat new_large_file | python your_script.py

Your script get line by line so your computer should not crash.

your_script.py -

import sys

INGGA_line = ""

for line in sys.stdin:
    line_striped = line.strip()
    if line_striped.startswith("$INGGA"):
        INGGA_line = line_striped
    elif line_striped.startswith("$INZDA"):
        print line_striped, INGGA_line
    else:
        print line_striped
avivb
  • 187
  • 11
0

This answer is aimed at python 3.

According to this other answer (and the docs), you can iterate your file line-by-line memory-efficiently:

with open(filename, 'r') as f:
    for line in f:
         ...process...

An example of how you could fulfill your above criteria could be

# Target file write-only, source file read-only
with open(targetfile, 'w') as tf, open(sourcefile, 'r') as sf:
    # Flag for whether we are looking for 1st or 2nd part
    look_for_ingga = True
    for line in sf:
        if look_for_ingga:
            if line.startswith('$INGGA,'):
                tf.write(line)
                look_for_ingga = False
        elif line.startswith('$INHDT,'):
            tf.write(line)
            look_for_ingga = True
  • In the case where you have multiple '$INGGA,' prior to the '$INHDT,', this grabs the first one and disregards the rest. In case you want to take only the last '$INGGA,' before the '$INHDT,', store the last '$INGGA,' in a variable instead of writing it to disk. Then, when you find your '$INHDT,', store both.
  • In case you meant that you want to write to a separate new file for each INGGA-INHDT pair, the target file with-statement should be nested inside for line in sf instead, or the results should be buffered in a list for later storage.

Refer to the docs for introductions to with-statements and file reading/writing.

Community
  • 1
  • 1
thorbjornwolf
  • 1,788
  • 18
  • 19