1

Trying to insert a line feed on large files in UNIX when a text string matches. Any file around 1GB or less, it works. Anything over that size, it does not complete the replace. It appears to do nothing.

I'm using the following command:

sed -i 's/"sourceSystemCode": "xyz"}{"active": true/"sourceSystemCode": "xyz"}\n{"active": true/g' filename.txt

I have even tried:

sed 's/"sourceSystemCode": "xyz"}{"active": true/"sourceSystemCode": "xyz"}\n{"active": true/g' filename.txt > newfile.txt

Any other suggestions to add the line feed using any other command or syntax with sed is greatly appreciated.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
TimBurke
  • 11
  • 2
  • How big is the biggest file? How long does it take the command to run for 1 GB file? – Timur Shtatland Jun 05 '23 at 17:52
  • the size of the files can vary. I ran it on a 1.01 GB file and it ran in less than 1 minute. Other files were 2.92 GB, 3.17 GB and 4.1 GB. When I ran the command against these files, it ran for 1-2 minutes but the files were not modified. – TimBurke Jun 05 '23 at 18:09

2 Answers2

0

Use this Perl one-liner, which accomplishes what you need:

perl -i -pe 's/"sourceSystemCode": "xyz"[}][{]"active": true/"sourceSystemCode": "xyz"}\n{"active": true/g' filename.txt

Note that curly braces: { and } have special meaning to the regex engine, so you need to escape them like so: \{ and \}, or put them into character classes like so: [{] and [}]. If you run the command as you have shown it, without escaping or character classes, it will not replace what you think it does (in Perl, at least).

Takes about 5 min for a 40 GB file on M1 MacBook Pro.

Timur Shtatland
  • 12,024
  • 2
  • 30
  • 47
  • I got an error. Out of memory! How can I resolve this? Is it a laptop environment or a unix environment issue? I tried it on a smaller file and it does indeed work. Thanks! – TimBurke Jun 05 '23 at 18:46
  • Maybe the entire file is a single long line? See: https://stackoverflow.com/q/44811130/967621 – Timur Shtatland Jun 05 '23 at 19:03
  • Yes, its a JSON file that got outputted as a single string. I'm trying to put the line feeds between the messages. – TimBurke Jun 05 '23 at 19:59
0

If occurrence of your string is frequent enough that records fit in memory, with GNU awk:

gawk \
    -v  RS='"sourceSystemCode": "xyz"}{"active": true' \
    -v ors='"sourceSystemCode": "xyz"}\n{"active": true' \
    -v ORS= \
'
    NR>1 { print ors }
    1;
' filename.txt > filename.txt.new &&
mv filename.txt.new filename.txt

Note that RS is a regular expression, not a simple string.

jhnc
  • 11,310
  • 1
  • 9
  • 26