The 'sed' approach:
sed ':a;N;$!ba;s/\n|/|/g' input.txt
Though, awk would be faster & easier to understand/maintain. I just had that example handy (a common solution for removing trailing newlines w/ sed).
EDIT:
To clarify the difference between this answer (option #1) and the alternative solution by @potong (which I actually prefer: sed ':a;N;s/\n|/|/;ta;P;D' file
), which I'll call option #2:
- note that these are two of many possible options with
sed
. I actually prefer non-sed
solutions since they do in general run faster. But these two options are notable because they demonstrate two distinct ways to process a file: option #1 all in-memory, and option #2 as a stream. (note: below when I say "buffer", technically I mean "pattern space"):
- option #1 reads the whole file into memory:
:a
is just a label; N
says append the next line to the buffer; if end-of-file ($
) is not (!
) reached, then branch (b
) back to label :a
...
- then after the whole file is read into memory, process the buffer with the substitution command (
s
), replacing all occurrences of "\n|
" (newline followed by "|
") with just a "|
", on the entire (g
) buffer
- option #2 just process a couple lines at a time:
- reads / appends the next line (
N
) into the buffer, processes it (s/\n|/|/
); branches (t
) back to label :a
only if the substitution was successful; otherwise prints (P
) and clears/deletes (D
) the current buffer up to the first embedded newline ... and the stream continues.
- option #1 takes a lot more memory to run. In general, as large as your file. Option #2 requires minimal memory; so small I didn't bother to see what it correlates to (I'm guessing the length of a line.)
- option #1 runs faster. In general, twice as fast as option #2; but obviously it depends on the file and what is being done.
On a ~500MB file, option #1 runs about twice as fast (1.5s vs 3.4s),
$ du -h /tmp/foobar.txt
544M /tmp/foobar.txt
$ time sed ':a;N;$!ba;s/\n|/|/g' /tmp/foobar.txt > /dev/null
real 0m1.564s
user 0m1.390s
sys 0m0.171s
$ time sed ':a;N;s/\n|/|/;ta;P;D' /tmp/foobar.txt > /dev/null
real 0m3.418s
user 0m3.239s
sys 0m0.163s
At the same time, option #1 takes about 500MB of memory, and option #2 requires less than 1MB:
$ ps -F -C sed
UID PID PPID C SZ RSS PSR STIME TTY TIME CMD
username 4197 11001 99 172427 558888 1 19:22 pts/10 00:00:01 sed :a;N;$!ba;s/\n|/|/g /tmp/foobar.txt
note: /proc/{pid}/smaps (Pss): 558188 (545M)
And option #2:
$ ps -F -C sed
UID PID PPID C SZ RSS PSR STIME TTY TIME CMD
username 4401 11001 99 3468 864 3 19:22 pts/10 00:00:03 sed :a;N;s/\n|/|/;ta;P;D /tmp/foobar.txt
note: /proc/{pid}/smaps (Pss): 236 (0M)
In summary (w/ commentary),
- if you have files of unknown size, streaming without buffering is a better decision.
- if every second matters, then buffering the entire file and processing it at once may be fine -- but ymmv.
- my personal experience with tuning shell scripts is that
awk
or perl
(or tr
, but it's the least portable) or even bash
may be preferable to using sed
.
- yet,
sed
is a very flexible and powerful tool that gets a job done quickly, and can be tuned later.