linux sed grep -P replace string with newline and taking next line into consideration

Question

I have a file that was created and I need to replace the last "," with "" so it will be valid JSON. The problem is that I can't figure out how to do it with sed or even with grep/piping to something else. I am really stumped here. Any help would be appreciated.

test.json

[
{MANY OTHER RECORDS, MAKING FILE 3.5Gig (making sed fail because of memory, so newlines were added)},
{"ID":"57705e4a-158c-4d4e-9e07-94892acd98aa","USERNAME":"jmael","LOGINTIMESTAMP":"2021-11-30"},
{"ID":"b8b67609-50ed-4cdc-bbb4-622c7e6a8cd2","USERNAME":"henrydo","LOGINTIMESTAMP":"2021-12-15"},
{"ID":"a44973d0-0ec1-4252-b9e6-2fd7566c6f7d","USERNAME":"null","LOGINTIMESTAMP":"2021-10-31"},
]

Of course, using grep with -P matches what I need to replace

grep -Pzo '"},\n]' test.json

You should emphasize **_"sed fail because of memory, so newlines were added"_**. — Fravadona, Dec 05 '22 at 18:25

score 1 · Answer 1 · answered Dec 05 '22 at 10:38

Using GNU sed

$ sed -Ez 's/([^]]*),/\1/' test.json
[
{MANY OTHER RECORDS, MAKING FILE 3.5Gig (making sed fail because of memory, so newlines were added)},
{"ID":"57705e4a-158c-4d4e-9e07-94892acd98aa","USERNAME":"jmael","LOGINTIMESTAMP":"2021-11-30"},
{"ID":"b8b67609-50ed-4cdc-bbb4-622c7e6a8cd2","USERNAME":"henrydo","LOGINTIMESTAMP":"2021-12-15"},
{"ID":"a44973d0-0ec1-4252-b9e6-2fd7566c6f7d","USERNAME":"null","LOGINTIMESTAMP":"2021-10-31"}
]

score 1 · Answer 2 · answered Dec 05 '22 at 10:50

Remove last comma in a file with GNU sed:

sed -zE 's/,([^,]*)$/\1/' file

Output to stdout:

[
{MANY OTHER RECORDS, MAKING FILE 3.5Gig (making sed fail because of memory, so newlines were added)},
{"ID":"57705e4a-158c-4d4e-9e07-94892acd98aa","USERNAME":"jmael","LOGINTIMESTAMP":"2021-11-30"},
{"ID":"b8b67609-50ed-4cdc-bbb4-622c7e6a8cd2","USERNAME":"henrydo","LOGINTIMESTAMP":"2021-12-15"},
{"ID":"a44973d0-0ec1-4252-b9e6-2fd7566c6f7d","USERNAME":"null","LOGINTIMESTAMP":"2021-10-31"}
]

See: man sed and The Stack Overflow Regular Expressions FAQ

Fravadona · Accepted Answer · 2022-12-09T14:47:09.497

An efficient solution would be to use perl to read the last n bytes of the file, then determine the position of the superfluous comma in those bytes (for ex. with a regex) and then replace this comma with a space character:

perl -e '
    $n = 16;                         # how many bytes to read
    open $fh, "+<", $ARGV[0];        # open file in read & write mode
    seek $fh, -$n, 2;                # go to the end minus some bytes
    $n = read $fh, $str, $n;         # load the end of the file
    if ( $str =~ /,\s*]\s*$/s ) {    # get position of comma
        seek $fh, -($n - $-[0]), 1;  # go to position of comma
        print $fh " ";               # replace comma with space char
    }
    close $fh;                       # close file
' log.json

The strong point of this solution is that it only reads a few bytes of the file for doing the replacement => that keeps the memory consumption to almost 0 and avoids reading through the whole file.

I should have tried this first! I just did and it works flawlessly and super fast. — user3008410, Dec 07 '22 at 12:49

score 1 · Answer 4 · answered Dec 06 '22 at 18:01

So below is the final solution I used for this, not the prettiest but it works with no memory issues and it does what I need. Thanks to Cyrus for helping. Hope this helps someone out.

find *.json | while read file; do

  _FILESIZE=$(stat -c%s "$file")

  if [[ $_FILESIZE -gt 2050000000 ]] ;then

    echo "${file} is too large = $(stat -c%s "${file}") bytes. will be split to work on."

    #get the name of the file without extension
    _FILENAME=$( echo "${file}" | sed -r "s/(.+)(\..+)/\1/" )

    #Split the large file with 3 extension, 1G size, no zero byte files, numeric suffix
    split -a 3 -e -d -b1G ${file} ${_FILENAME}_

    #Because pipe runs in new shell you must do it this way.
    _FINAL_FILE_NAME_SPLIT=
    while read file_split; do
      _FINAL_FILE_NAME_SPLIT=${file_split}
    done < <(find ${_FILENAME}_* | sort -z)

    #The last file has the change we need to make @@ "null"}, \n ] @@ to @@ "null"} \n ] @@
    sed -i -zE 's/},([^,]*)$/}\1/' ${_FINAL_FILE_NAME_SPLIT}

    #Rebuild the split files to replace the final file.
    cat ${_FILENAME}_* > ${file}

    #Remove the split files
    rm -r *_00*

  else

    sed -i -zE 's/},([^,]*)$/}\1/' ${file}

  fi

  #Check that the file is a valid json file.
  cat ${file} | jq '. | length'

  #view the change
  tail -c 50 ${file}

  echo " "
  echo " "

done

Nice trick; splitting the file for limiting the memory consumption of GNU `sed -z` . You should try the perl solution though, it's instantaneous even if the file is 500GB big. +1 for working up something by yourself ;-) — Fravadona, Dec 06 '22 at 23:29

linux sed grep -P replace string with newline and taking next line into consideration

4 Answers4