0

i have huge file(~2000000 lines) and i am trying to replace few different patterns while i am reading the file only once.

so i am guessing sed is not good since i have different patterns i tried to use awk with if else but the file is not change

#!/usr/bin/awk -f
{

    if($0 ~ /data for AAA/)
    {

        sub(/^[0-9]+$/, "bla_AAA", $2)

    }
    if($0 ~ /data for BBB/)
    {

        sub(/^[0-9]+$/, "bla_BBB", $2)

    }


}

I expect the output of

address 01000 data for AAA
....
address 02000 data for BBB
....

to be

address bla_AAA data for AAA
....
address bla_BBB data for BBB
....
Yonatan Amir
  • 85
  • 1
  • 11
  • 1
    Is there really a space before the shebang? It's a problem if there really is. If there isn't, please edit your post. – Mad Physicist May 12 '19 at 07:47
  • Do you mean with "in place" to modify the file instead of creating a new one? This is not possible (except you keep the whole file in memory) because the regex replacement would insert characters. – Michael Butscher May 12 '19 at 08:12
  • If you use Unix/Linux then no real "in place" edit is possible if file size changes. – Cyrus May 12 '19 at 08:44
  • See: [change a few bytes in a large file without loading everything in memory using bash on linux](https://stackoverflow.com/q/16203567/3776858) – Cyrus May 12 '19 at 08:56
  • i edited the code no space before the shebang. is there any way to do it in place maybe in python or cpp? – Yonatan Amir May 12 '19 at 09:10
  • I didn't understand what is wrong with sed -i : '/data for AAA/s/^[0-9]+$/bla_AAA/ ; /data for BBB/... ' – Eran Ben-Natan May 12 '19 at 12:44
  • 1
    @EranBen-Natan: From `info sed`: *-i: This option specifies that files are to be edited in-place. GNU 'sed' does this by creating a temporary file and sending output to this file rather than to the standard output.(1).* – Cyrus May 12 '19 at 13:08

2 Answers2

1

I don't see any indication in your question that your file really is large as 2000000 lines is nothing and each sample line in your question is small, so chances are this is all you need:

awk '
/data for AAA/ { $2 = "bla_AAA"; next }
/data for BBB/ { $2 = "bla_BBB"; next }
' file > tmp && mv tmp file

GNU awk has a -i inplace option to do the same kind of "inplace" editing that sed, perl, etc. do (i.e. with a tmp file being used internally).

If you really didn't have enough storage to create a copy of the input file then you could use something like this (untested!):

headLines=10000
beg=1
tmp=$(mktemp) || exit 1
while -s file; do
    head -n "$headLines" file | awk 'above script' >> "$tmp" &&
    headBytes=$(head -n "$headLines" file |wc -c) &&
    dd if=file bs="$headBytes" skip=1 conv=notrunc of=file &&
    truncate -s "-$headBytes" file
    rslt=$?
done
(( rslt == 0 )) && mv "$tmp" file

so you're never using up more storage than the size of your input file plus headLines lines (massage that number to suit). See https://stackoverflow.com/a/17331179/1745001 for info on what truncate and the 2 lines before it are doing.

Ed Morton
  • 188,023
  • 17
  • 78
  • 185
0

Something like this:

(Read a line, do the text manipulation, write the modified data to output file)

with open('in.txt') as f_in:
    with open('out.txt', 'w') as f_out:
        line = f_in.readline().strip()
        while line:
            fields = line.split(' ')
            fields[1] = 'bla_{}'.format(fields[4])
            f_out.write(' '.join(fields) + '\n')
            line = f_in.readline()
balderman
  • 22,927
  • 7
  • 34
  • 52