2

I have a file which is much larger than the amount of memory available on the server which needs to run this script.

In that file, I need to run a basic regex which does a find and replace across two lines at a time. I've looked at using sed, awk, and perl, but I haven't been able to get any of them to work as I need it in this instance.

On a smaller file, the following line does what I need it to: perl -0777 -i -pe 's/,\s+\)/\n\)/g' inputfile.txt

In essence, any time a line ends in a comma and the next line starts in a closing parenthesis, remove the comma.

When I tried to run that on my production file I just got the message "Killed" in the terminal after a couple of minutes and the file contents were completely erased. I was watching memory usage during that and as expected it was running at 100% and using the swap space extensively.

Is there a way to make that perl command run on two lines at a time instead, or an alternative bash command which might achieve the same result?

If it makes it easier by keeping the file size identical then I also have the option of replacing the comma with a space character.

bdx
  • 3,316
  • 4
  • 32
  • 65
  • I am not sure of the exact problem statement in all that -- So: _if a line ends in comma and the next one starts in `)`_ then remove the comma, otherwise do nothing. Is that a full statement of the problem? – zdim Sep 26 '19 at 03:05
  • Yes, that's correct. – bdx Sep 26 '19 at 03:06
  • Can the lines with an open paren also end in a comma? – Shawn Sep 26 '19 at 03:18
  • Lines containing an opening parenthesis and/or closing parenthesis can also end in a comma. – bdx Sep 26 '19 at 03:20
  • Just want to understand: So you are only interested in instances of `,\n)` in the file, and you want to convert all of those to `\n)`? – DavidO Sep 26 '19 at 03:54

5 Answers5

3

A fairly direct logic:

  • print a line unless it ends with a comma (need to check on the next line, perhaps remove it)

  • print the previous line ($p) if it had a comma, without it if the current line starts with )

perl -ne'
    if ($p =~ /,$/) { $p =~ s/,$// if /^\s*\)/; print $p }; 
    print unless /,$/; 
    $p = $_
' file

Efficiency of this can be improved some, by losing one regex (so engine startup overhead goes) and some data copy but at the expense of clumsier code, having additional logic and checks.

Tested with file

hello
here's a comma,
which was fine
(but here's another,
) which has to go,
and that was another good one.
end

The above fails to print the last line if it ends in a comma. One fix for that is to check our buffer (previous line $p) in an END block, so to add at the end

END { print $p if $p =~ /,$/}

This is a fairly usual way to check for trailing buffers or conditions in -n/-p one-liners.

Another fix, less efficient but with perhaps cleaner code, is to replace the statement

print unless /,$/;

with

print if (not /,$/ or eof);

This does run an eof check on every line of the file, while END runs once.

zdim
  • 64,580
  • 5
  • 52
  • 81
2

If using \n newline as a record separator is awkward, use something else. In this case you're specifically interested in the sequence ,\n), and we can let Perl find that for us as we read the file:

perl -pe 'BEGIN{ $/ = ",\n)" } s/,\n\)/\n)/' input.txt >output.txt

This portion: $/ = ",\n)" tells Perl that instead of iterating over lines of the file, it should iterate over records that terminate with the sequence ,\n). That helps us to assure that every chunk will have at most one such sequence, but more importantly, that this sequence will not span chunks (or records, or file-reads). Every chunk read will either end in ,\n) or in the case of the final record, may end not have a record terminator (by our definition of terminator).

Next we just use substitution to eliminate that comma in our ,\n) record separator sequence.

The key here really is that by setting the record separator to the very sequence we're interested in, we guarantee the sequence will not get broken across file-reads.

As has been mentioned in the comments, this solution is most useful only if the span between ,\n) sequences doesn't exceed the amount of memory you are willing to throw at the problem. It is most likely that newlines themselves occur in the file more often than ,\n) sequences, and so, this will read in larger chunks. You know your data set better than we do, and so are in a better position of judging whether the simplicity of this solution is outweighed by the footprint it consumes in memory.

DavidO
  • 13,812
  • 3
  • 38
  • 66
  • 1
    Nice idea but may end up reading _huge_ chunks (or even whole file) ... – zdim Sep 26 '19 at 04:34
  • Yes, it does make that assumption. But on the other hand, `\n` as a record separator makes assumptions about line length. – DavidO Sep 26 '19 at 04:35
  • Again, I agree that this will read records as large as the span between occurrences of `,\n)`. I'll mention that in the answer. However, although we're told in the OP's question that the file is large, we're not told whether the trigger sequence is sparse. If it is, yes, this could be problematic the same way that a file with huge lines could be problematic. We are most likely reading larger chunks than we would if we split only on newline. – DavidO Sep 26 '19 at 04:37
  • For me the only problem is unpredictability: what if the file has no such sequence (or very few) and the program gets killed? – zdim Sep 26 '19 at 04:41
2

Delay printing out the trailing comma and line feed until you know it's ok to print it out.

perl -ne'
   $_ = $buf . $_;
   s/^,(?=\n\))//;
   $buf = s/(,\n)\z// ? $1 : "";
   print;
   END { print $buf; }
'

Faster:

perl -ne'
   print /^\)/ ? "\n" : ",\n" if $f;
   $f = s/,\n//;
   print;
   END { print ",\n" if $f; }
'

Specifying file to process to Perl one-liner

ikegami
  • 367,544
  • 15
  • 269
  • 518
1

This can be done more simply with just awk.

awk 'BEGIN{RS=".\n."; ORS=""} {gsub(",\n)", "\n)", RT); print $0 RT}'

Explanation:

awk, unlike Perl, allows a regular expression as the Record Separator, here .\n. which "captures" the two characters surrounding each newline.

Setting ORS to empty prevents print from outputting extra newlines. Newlines are all captured in RS/RT.

RT represents the actual text matched by the RS regular expression.

The gsub removes any desired comma from RT if present.

Caveat: You'd need gnu awk (gawk) for this to work. It seems that POSIX-only awk will lack the regexp-RS with RT variable feature, according to gawk man page.

Note: gsub is not really needed, sub is good enough and probably should have been used above.

Jeff Y
  • 2,437
  • 1
  • 11
  • 18
1

This might work for you (GNU sed):

sed 'N;s/,\n)/\n)/;P;D' file

Keep a moving window of two lines throughout the file and if the first ends in a , and the second begins with ), remove the ,.

If there is white space and it needs to be preserved, use:

sed 'N;s/,\(\s*\n\s*)\)/\1/;P;D' file
potong
  • 55,640
  • 6
  • 51
  • 83