Splitting very long (4GB) string with new lines

Question

I have a file that is supposed to be JSON objects, one per line. Unfortunately, a miscommunication happened with the creation of the file, and the JSON objects only have a space between them, not a new-line.

I need to fix this by replacing every instance of } { with }\n{.

Should be easy for sed or Perl, right?

sed -e "s/}\s{/}\n{/g" file.in > file.out

perl -pe "s/}\s{/}\n{/g" file.in > file.out

But file.in is actually 4.4 GB which seems to be causing a problem for both of these solutions.

The sed command finishes with a halfway-correct file, but file.out is only 335 MB and is only about the first 1/10th of the input file, cutting off in the middle of a line. It's almost like sed just quit in the middle of the stream. Maybe it's trying to load the entire 4.4 GB file into memory but running out of stack space at around 300MB and silently kills itself.

The Perl command errors with the following message:

[1] 2904 segmentation fault perl -pe "s/}\s{/}\n{/g" file.in > file.out

What else should I try?

You could check the answers to this question: https://stackoverflow.com/questions/6951687/find-and-replace-text-in-a-47gb-large-file — N Shumway, Jun 28 '17 at 19:42
The proper solution is to get the originator to create valid data. Why are you writing code to correct someone else's mistake? What would happen if the error couldn't be corrected at your end? How can such a "miscommunication" happen in the first place, and why does your company need Stack Overflow to fix their mistake? This is disgraceful at all levels, and management should not be getting you to fix errors like this. — Borodin, Jun 28 '17 at 22:18
Of course sed is trying to read the whole file into memory, sed reads one line at a time and your file contains one line. — Ed Morton, Jun 28 '17 at 22:42

ikegami · Answer 1 · 2017-06-28T20:38:50.787

Unlike the earlier solutions, this one handles {"x":"} {"}.

use strict;
use warnings;
use feature qw( say );

use JSON::XS qw( );

use constant READ_SIZE => 64*1024*1024;

my $j_in = JSON::XS->new->utf8;
my $j_out = JSON::XS->new;

binmode STDIN;
binmode STDOUT, ':encoding(UTF-8)';

while (1) {
   my $rv = sysread(\*STDIN, my $block, READ_SIZE);
   die($!) if !defined($rv);
   last if !$rv;

   $j_in->incr_parse($block);

   while (my $o = $j_in->incr_parse()) {
      say $j_out->encode($o);
   }
}

die("Bad data") if $j_in->incr_text !~ /^\s*\z/;

Nahuel Fouilleul · Answer 2 · 2017-06-28T20:10:35.473

1

perl -ple 'BEGIN{$/=qq/} {/;$\=qq/}\n{/}undef$\ if eof' <input >output

edited Jun 28 '17 at 20:10

answered Jun 28 '17 at 19:49

Nahuel Fouilleul

18,726
2
31
36

So, what are `$/` and `$\ `? – stevesliva Jun 28 '17 at 19:56
`perlvar` `$/`: input record separator and `$\ `: output record separator. There is a problem with the command is that one more "}\n{" is added at the end because of `-l` option – Nahuel Fouilleul Jun 28 '17 at 19:59
thanks I thought to `undef $\ if eof` and it was not due to `-l` option but because of print – Nahuel Fouilleul Jun 28 '17 at 20:10
@stevesliva: *"So, what are `$/` and `$\ `?"* This is an answer to a Perl question. If you don't know the language then you need to ask a separate Stack Overflow question. Good luck for not getting a RTFM response. – Borodin Jun 28 '17 at 22:23
@Borodin I'd delete the question if the answer were modified, but asking for the keystone of this answer to be explained wasn't because I didn't know where to find perlvar... but because it's simply not useful without the string "record separator." – stevesliva Jun 28 '17 at 23:41
@stevesliva: As you said, the answer needs some explanation. I agree. But there's a better way to say that and you don't have to resort to name calling. – Borodin Jun 28 '17 at 23:51

score 1 · Accepted Answer · answered Jun 28 '17 at 19:50

1

The default input record separator in Perl is \n, but you can change it to any character you want. For this problem, you could use { (octal 173).

perl -0173 -pe 's/}\s{/}\n{/g' file.in > file.out

answered Jun 28 '17 at 19:50

mob

117,087
18
149
283

1

That will fail when your input contains `{`s in other contexts. – Ed Morton Jun 28 '17 at 22:44
@mob: I think Ed's concern is unfounded. As long as no occurrence of `} {` can be split across records they should all be substituted correctly. All I would change is `s/\}\s*\{/}\n{/g` whiçh escapes the braces in the pattern and allows for zero or more white space characters between them. – Borodin Jun 28 '17 at 23:19
@ikegami no, barring writing a JSON parser I'm not worried about `} {` appearing elsewhere and am happy to take the OPs word for it that every instance of that string should be operated on. I was actually thinking of `{` alone inside a comment or `{` in the context of nested values or something else. I'm not familiar enough with JSON to say what that something else might be, it's just that **usually** when trying to do something with a string if you decide to do something else with part of that string then it leads to problems. Borodin may be right that in this specific case it's not an issue. – Ed Morton Jun 29 '17 at 16:48
@Ed Morton, mob's code doesn't replace `{`; it replaces `} {`. The only place `} {` can appear in [JSON](http://www.json.org/) is in string literals (e.g. `{"x":"} {"}`). (JSON doesn't have comments.) – ikegami Jun 29 '17 at 16:59
Yes, I understand that but he said he's using `{` alone to split the input into records and doing something like that (using part of the target string instead of the whole of it in some context) is what often causes problems where the solution will work for the posted sample and then fail later with different/real input. Most commonly we see people wanting to identify text between `{foo}` strings and they write regexps like `/{foo}[^{]*/'` since they can't figure out how to negate `{foo}`. Again, idk if it's really a problem in this case. I meant string, not comment, sorry. – Ed Morton Jun 29 '17 at 19:25

score 0 · Answer 4 · edited Jun 29 '17 at 16:44

0

You may read input in blocks/chunks and process them one by one.

use strict;
use warnings;

binmode(STDIN);
binmode(STDOUT);
my $CHUNK=0x2000; # 8kiB
my $buffer = '';

while( sysread(STDIN, $buffer, $CHUNK, length($buffer))) {
  $buffer =~ s/\}\s\{/}\n{/sg;
  if( length($buffer) > $CHUNK) { # More than one chunk buffered
    syswrite( STDOUT, $buffer, $CHUNK); # write  FIRST of buffered chunks
    substr($buffer,0,$CHUNK,''); # remove FIRST of buffered chunks from buffer
  }
}
syswrite( STDOUT, $buffer) if length($buffer);

edited Jun 29 '17 at 16:44

ikegami

367,544
15
269
518

answered Jun 28 '17 at 19:59

AnFi

10,493
3
23
47

@ikegami My original version worked on two chunks in the buffer to handle match over chunk boundary. Why have you selected so big chunk size? Anyway "read in chunks" is an overkill for "use **once**" script IMHO. – AnFi Jun 28 '17 at 20:31
@AndrzejA.Filip: I have rolled your solution back to before the major edit, but you must understand that that reintroduces the bug whereby your code will not work if the buffer ends in the middle of a `} {` boundary. The editor has posted their own solution, so I don't see any sense in them hijacking yours as well. But ***please*** address the problem, at least by annotating your code, and ideally by fixing it so that it works properly. – Borodin Jun 28 '17 at 22:49
1

@borodin 1) I have added comments indicating why my (original) code should handle "pattern across chunk boundary" situation. - it keeps TWO chunks buffered. "Double rewrite" makes no harm with this pattern 2) **ikegami** "fundamental rewrite" is better suited for handling short patterns - I have learned something :-) – AnFi Jun 28 '17 at 23:25
@Borodin `0x2000` is a hex number - `perl -e 'print 0x2000'` prints 8192. – AnFi Jun 29 '17 at 00:03
You're right of course. I'm sorry for such a silly mistake. Even so, I would prefer to see `$CHUNK = 1024 * 8` which would be evaluated at compile time and makes the comment unnecessary. – Borodin Jun 29 '17 at 15:48

Ed Morton · Answer 5 · 2017-06-28T22:46:27.153

Assuming your input doesn't contain } { pairs in other contexts that you do not want replaced, ll you need is:

awk -v RS='} {' '{ORS=(RT ? "}\n{" : "\n")} 1'

e.g.

$ printf '{foo} {bar}' | awk -v RS='} {' '{ORS=(RT ? "}\n{" : "\n")} 1'
{foo}
{bar}

The above uses GNU awk for multi-char RS and RT and will work on any size input file as it does not read the whole file into memory at one time, just each } {-separated "line" one at a time.

Splitting very long (4GB) string with new lines

5 Answers5

Linked