0

I have the following awk command to join lines which are smaller than a limit (it is basically used to break lines in multiline fixed-width file):

awk 'last{$0=last $0;} length($0)<21{last=$0" ";next} {print;last=""}' input_file.txt > output_file.txt
input_file.txt:
1,11,"dummy
111",1111
2,22,"dummy 222",2222
3,33,"dummy 333",3333

output_file.txt (expected):
1,11,"dummy 111",1111
2,22,"dummy 222",2222
3,33,"dummy 333",3333

The script works pretty well with small files (~MB) but it does nothing with big files (~GB).

What may be the problem?

Thanks in advance.

d_c
  • 53
  • 7
  • Can you debug with `awk 'NR==1000 {exit} last{$0=last $0; print "Joined line: " $0} length($0)<21{last=$0" "; {print "Joining " last; last=""}' input_file.txt`. – Walter A Dec 24 '21 at 15:00
  • can you wxpand on what you mean by `does nothing`? nothing is output? the output is just a copy of the input (ie, no changes made)? `awk` hangs and never comes back? something else? – markp-fuso Dec 24 '21 at 15:08
  • Can you look what happens when you run the awk on both your input_file.txt and a big_file.txt, with `awk your_script input_file.txt big_file.txt | head -5` – Walter A Dec 24 '21 at 15:29
  • By ```does nothing``` I mean the output_file.txt content is the same of input_file.txt content (lines are not joined). – d_c Dec 24 '21 at 15:47

1 Answers1

2

Best guess - all the lines in your big file are longer than 21 chars. There are more robust ways to do what you're trying to do with that script, though, so it may not be worth debugging this and ask for help with an improved script instead.

Here's one more robust way to combine quoted fields that contain newlines using any awk:

$ awk -F'"' '{$0=prev $0; if (NF%2){print; prev=""} else prev=$0 OFS}' input_file.txt
1,11,"dummy 111",1111
2,22,"dummy 222",2222
3,33,"dummy 333",3333

That may be a better starting point for you than your existing script. To do more than that, see What's the most robust way to efficiently parse CSV using awk?.

Ed Morton
  • 188,023
  • 17
  • 78
  • 185