tl;dr
None of the solutions below update the input file in place; the stand-alone sed
commands could be adapted with -i ''
to do that; the awk
solutions require saving to a different file first.
- The OP's input appears to be a file with classic Mac OS
\r
-only line breaks
Thanks, @alvits.
.
sed
invariably reads such a file as a whole, which is typically undesired and gets in the way of the OP's line-leading whitespace-trimming approach.
awk
is therefore the better choice, because it allow specifying what constitutes a line break (via the so-called input record separator):
Update: Replaced the original awk
command with a simpler and faster alternative, adapted from peak's solution:
awk -v RS='\r' '{ sub(/^[ \t]+/, ""); print }'
If it's acceptable to also trim trailing whitespace, if any, from each line and to normalize whitespace between words on a line to a single space each, you can simplify to:
awk -v RS='\r' '{ $1=$1; print }'
Note that the output lines will be \n
-separated, as is typically desired.
For an explanation and background information, including how to preserve \r
as line breaks, read on.
Note: The first part of the answer applies generally, but assumes that the input has \n
-terminated lines; the OP's special case, where lines are apparently \r
-only-terminated, is handled in the 2nd part.
BSD Sed, as used on OSX, only supports \n
as a control-character escape sequence; thus, \t
for matching tab chars. is not supported.
To still match tabs, you can splice an ANSI C-quoted string yielding an actual tab char. into your Sed script ($'\t'
):
sed 's/^[ '$'\t'']*//'
In this simple case you could use an ANSI C-quoted string for the entire Sed script (sed -e $'s/^[ \t]*//'
), but this can get tricky with more complex scripts, because such strings have their own escaping rules.
- Note that option
g
was removed, because it is pointless, given that the regex is anchored to the start of the input (^
).
- For a summary of the differences between GNU and BSD Sed, see this answer of mine.
As @alvits points out in a comment, the input file may actually have \r
instances instead of the \n
instances that Sed requires to separate lines.
I.e., the file may have Pre-OSX Mac OS line terminators: an \r
by itself
terminates a line.
An easy way to verify that is to pass the input file to cat -et
: \r
instances are visualized as ^M
, whereas \n
instances are visualized as $
(additionally, \t
instances are visualized as ^I
).
If only ^M
instances, but no $
instances are in the output, the implication is that lines aren't terminated with \n
(also), and the entire input file is treated as a single string, which explains why only the first input "line" was processed: the ^
only matched at the very beginning of the entire string.
Since a Sed solution (without preprocessing) causes the entire file to be read as a whole, awk
is the better choice:
To create \n
-separated output, as is customary on Unix-like platforms:
awk -v RS='\r' '{ sub(/^[ \t]+/, ""); print }'
-v RS='\r'
tells Awk to split the input into records by \r
instances (special variable RS
contains the input record separator).
sub(/^[ \t]+/, "")
searches for the first occurrence of regex ^[ \t]+
on the input line and replaces it with ""
, i.e., it effectively trims a leading run of spaces and tabs from each input line. Note that sub()
without an explicit 3rd argument implicitly operates on $0
, the whole input line.
print
then prints the potentially modified modified input line.
By virtue of \n
being Awk's default output record separator (OFS
), the output records will be \n
-terminated.
If you really want to retain \r
as the line separator:
awk 'BEGIN { RS=ORS="\r" } { sub(/^[ \t]+/, ""); print }'
RS=ORS="\r"
sets both the input and the output record separator to \r
.
If it's acceptable to also trim trailing whitespace, if any, from each line and to normalize whitespace between words on a line to a single space each, you can simplify the \n
-terminated variant to:
awk -v RS='\r' '{ $1=$1; print }'
Not using -F
(and neither setting FS
, the input field separator, in the script) means that Awk splits the input record into fields by runs of whitespace (spaces, tabs, newlines).
$1=$1
is dummy assignment whose purpose is to trigger rebuilding of the input line, which happens whenever a field variable is assigned to.
The line is rebuilt by joining the fields with OFS
, the output-field separator, which defaults to a single space.
In effect, leading and trailing whitespace is thereby trimmed, and each run of line-interior whitespace is normalized to a single space.
If you do want to stick with sed
1
- even if that means reading the whole file at once:
sed $'s/^[ \t]*//; s/\r[ \t]*/\\\n/g' # note the $'...' to make \t, \r, \n work
This will output \n
-terminated lines, as is customary on Unix.
If, by contrast, you want to retain \r
as the line separators, use the following - but note that BSD Sed will invariably add a \n
at the very end.
sed $'s/^[ \t]*//; s/\r[ \t]*/\r/g'
[1] peak's answer originally showed a pragmatic multi-utility alternative more clearly: replace all \r
instances with \n
instances using tr
, and pipe the result to the BSD-Sed-friendly version of the original sed
command:
tr '\r' '\n' file | sed $'s/^[ \t]*//'