sed only replaces first leading white space match for only a particular file - dealing with CR-only line endings

Question

^{Editor's note:

The title was amended later with the benefit of hindsight; there were two distinct problems:

(a) it turned out that the input file had \r-only (CR-only) line endings (classic Mac OS-style)

(b) attempts to use \t and \r in sed regexes failed, because BSD Sed (as used on OSX) doesn't support such escapes.}

I'm working on an Automator program that uses Python to find-and-replace certain words in a text file. The program uses a dictionary, and there are instances in which the value used as a replacement is '' (meaning, nothing). I don't think that the program is causing this issue, but I just mention this by way of context. (The problem, I think, lies with sed, so I was reluctant to tag Python.)

Some lines in the file have leading white space that are created inadvertently after certain words at the beginning of a file are replaced by nothing. I want to get rid of them, and I think sed is the best tool for the job in this case.

Let's say this is what the text file looks like:

  Display
  Display
 BOX,

So I'm running the edited file through sed using this:

sed -e 's/^[ \t]*//g'

This is the result:

 Display
  Display
 BOX,

Only the first match is edited. Why?

By way of a test, I created a brand new plain text file like this:

 hello
 hello
 hello

Then I ran the command above on it. That actually worked as expected. Why?

Is it possible that there is some other form of space being used (a non-printable character?) that was created by the Python program? But then why would sed work at least once?

By the way, I am open to another portable solution or tool compatible with OS X for trimming leading white space from every line in a plain text file.

Edit: Here is some of the xxd output of the file (replaced most actual content with X):

0000000: 2044 6973 706c 6179 2043 616c 6962 7261   X X
0000010: 7469 6f6e 2046 6978 7475 7265 2046 4952  X X X
0000020: 4d57 4152 4520 4b49 545e 4d20 4469 7370  X X^M X
0000030: 6c61 7920 4361 6c69 6272 6174 696f 6e20  X X 
0000040: 4669 7874 7572 6520 524d 6163 426f 6f6b  X X
0000050: 2041 6972 2028 3131 2d69 6e63 682c 204d   X X
0000060: 6964 2032 3031 3229 2050 4f52 5420 4b49  X X) X X
0000070: 545e 4d42 4f58 2c20 5245 434f 5645 5259  T^MBOX, X

`cat -v` shows `^M ` at the beginning of lines with spaces and every other line except the very first one. Every line begins with `^M` because it is the carriage return character I believe. The spaces appear to be just a regular spaces. — dispepsi, Feb 10 '16 at 23:29
They're not regular spaces. If you pipe that file to `xxd`, you can see what they actually are. — Benjamin W., Feb 10 '16 at 23:36
Okay, here is a slice of the `xxd` output of a line that begins with a space: `0000020: 4d57 4152 4520 4b49 545e 4d20 4469 7370 XXXXX XXX^M Disp` I am not familiar with `xxd` so I don't know how to interpret this. — dispepsi, Feb 10 '16 at 23:42
Here is a bit more (I used X where I edited out the actual content of my file): `0000000: 2044 6973 706c 6179 2043 616c 6962 7261 X X 0000010: 7469 6f6e 2046 6978 7475 7265 2046 4952 X X X 0000020: 4d57 4152 4520 4b49 545e 4d20 4469 7370 X XX^M X 0000030: 6c61 7920 4361 6c69 6272 6174 696f 6e20 XX XX 0000040: 4669 7874 7572 6520 524d 6163 426f 6f6b XX XX 0000050: 2041 6972 2028 3131 2d69 6e63 682c 204d X (XXX 0000060: 6964 2032 3031 3229 2050 4f52 5420 4b49 XXXXX XX` — dispepsi, Feb 10 '16 at 23:47
@celestialroad `^M` means carriage return. Try adding `\r` to your regex: `s/^[ \t\r]*//g` — Andreas Louv, Feb 10 '16 at 23:55
Adding `\r` to the regex as described doesn't resolve the issue. — dispepsi, Feb 10 '16 at 23:58
Did you pipe the output of `cat -v` to `xxd`? It shows the literal `^M`. You have to pipe the unmodified file to make sense of the `xxd` output. — Benjamin W., Feb 11 '16 at 00:05
`sed` is a line oriented stream editor. The whole text is treated by `sed` as a single line. This means that the regex `^` start of line only applies to the first word `Display` because the rest of the file is a continuation. You might want to convert those carriage returns to newlines. — alvits, Feb 11 '16 at 01:52
I like posting here because I tend to inadvertently learn things I didn't expect to. :) — dispepsi, Feb 11 '16 at 17:41

score 2 · Accepted Answer · edited May 23 '17 at 12:23

tl;dr

^{None of the solutions below update the input file in place; the stand-alone sed commands could be adapted with -i '' to do that; the awk solutions require saving to a different file first.}

The OP's input appears to be a file with classic Mac OS \r-only line breaks^{Thanks, @alvits.} .
sed invariably reads such a file as a whole, which is typically undesired and gets in the way of the OP's line-leading whitespace-trimming approach.
awk is therefore the better choice, because it allow specifying what constitutes a line break (via the so-called input record separator):

Update: Replaced the original awk command with a simpler and faster alternative, adapted from peak's solution:

awk -v RS='\r' '{ sub(/^[ \t]+/, ""); print }'

If it's acceptable to also trim trailing whitespace, if any, from each line and to normalize whitespace between words on a line to a single space each, you can simplify to:

awk -v RS='\r' '{ $1=$1; print }'

Note that the output lines will be \n-separated, as is typically desired. For an explanation and background information, including how to preserve \r as line breaks, read on.

^{Note: The first part of the answer applies generally, but assumes that the input has \n-terminated lines; the OP's special case, where lines are apparently \r-only-terminated, is handled in the 2nd part.}

BSD Sed, as used on OSX, only supports \n as a control-character escape sequence; thus, \t for matching tab chars. is not supported.

To still match tabs, you can splice an ANSI C-quoted string yielding an actual tab char. into your Sed script ($'\t'):

sed 's/^[ '$'\t'']*//'

^{In this simple case you could use an ANSI C-quoted string for the entire Sed script (sed -e $'s/^[ \t]*//'), but this can get tricky with more complex scripts, because such strings have their own escaping rules.}

Note that option g was removed, because it is pointless, given that the regex is anchored to the start of the input (^).
For a summary of the differences between GNU and BSD Sed, see this answer of mine.

As @alvits points out in a comment, the input file may actually have \r instances instead of the \n instances that Sed requires to separate lines.

^{I.e., the file may have Pre-OSX Mac OS line terminators: an \r by itself terminates a line.}

An easy way to verify that is to pass the input file to cat -et: \r instances are visualized as ^M, whereas \n instances are visualized as $ (additionally, \t instances are visualized as ^I).

If only ^M instances, but no $ instances are in the output, the implication is that lines aren't terminated with \n (also), and the entire input file is treated as a single string, which explains why only the first input "line" was processed: the ^ only matched at the very beginning of the entire string.

Since a Sed solution (without preprocessing) causes the entire file to be read as a whole, awk is the better choice:

To create \n-separated output, as is customary on Unix-like platforms:

awk -v RS='\r' '{ sub(/^[ \t]+/, ""); print }'

-v RS='\r' tells Awk to split the input into records by \r instances (special variable RS contains the input record separator).
sub(/^[ \t]+/, "") searches for the first occurrence of regex ^[ \t]+ on the input line and replaces it with "", i.e., it effectively trims a leading run of spaces and tabs from each input line. Note that sub() without an explicit 3rd argument implicitly operates on $0, the whole input line.
print then prints the potentially modified modified input line.
By virtue of \n being Awk's default output record separator (OFS), the output records will be \n-terminated.

If you really want to retain \r as the line separator:

awk 'BEGIN { RS=ORS="\r" } { sub(/^[ \t]+/, ""); print }'

RS=ORS="\r" sets both the input and the output record separator to \r.

If it's acceptable to also trim trailing whitespace, if any, from each line and to normalize whitespace between words on a line to a single space each, you can simplify the \n-terminated variant to:

awk -v RS='\r' '{ $1=$1; print }'

Not using -F (and neither setting FS, the input field separator, in the script) means that Awk splits the input record into fields by runs of whitespace (spaces, tabs, newlines).
$1=$1 is dummy assignment whose purpose is to trigger rebuilding of the input line, which happens whenever a field variable is assigned to.
The line is rebuilt by joining the fields with OFS, the output-field separator, which defaults to a single space.
In effect, leading and trailing whitespace is thereby trimmed, and each run of line-interior whitespace is normalized to a single space.

If you do want to stick with sed¹ - even if that means reading the whole file at once:

sed $'s/^[ \t]*//; s/\r[ \t]*/\\\n/g' # note the $'...' to make \t, \r, \n work

This will output \n-terminated lines, as is customary on Unix.

If, by contrast, you want to retain \r as the line separators, use the following - but note that BSD Sed will invariably add a \n at the very end.

 sed $'s/^[ \t]*//; s/\r[ \t]*/\r/g'

^{[1] peak's answer originally showed a pragmatic multi-utility alternative more clearly: replace all \r instances with \n instances using tr, and pipe the result to the BSD-Sed-friendly version of the original sed command:

tr '\r' '\n' file | sed $'s/^[ \t]*//'}

The carriage returns are not treated as newlines. Any spaces or tabs after the first word is not at the beginning of a line, hence they fail at the anchor `^`. — alvits, Feb 11 '16 at 01:55
@alvits: Good point, thanks - I was focusing on the attempt to use escape sequence `\t`, which won't work in BSD Sed (and neither does `\r`). I've updated my answer based on your pointer. — mklement0, Feb 11 '16 at 02:07
Thanks very much for your comprehensive answer. Your first `awk` solution solved the issue very well for my purposes. — dispepsi, Feb 11 '16 at 17:38

peak · Answer 2 · 2016-02-11T21:50:07.160

2

If (as seems to be the case) the input file uses \r as the "end-of-line" character, then whatever else is done, it would probably make sense to convert the '\r' to '\n' or CRLF, depending on the platform. Assuming that '\n' is acceptable, and if there is any point in saving the original file with the CR replaced by LF, you could use tr:

tr '\r' '\n' < INFILE > OUTFILE

With a bash-like shell, you could then invoke sed like so:

sed -e $'s/^[ \t]*//' OUTFILE

The tr and sed commands could of course be strung together (tr ... | sed ...) but that incurs the overhead of a pipeline.

If you have no interest in saving the original file with the CR replaced by LF, then you may wish to consider the following one-stop awk variation:

awk -v RS='[\r]' '{s=$0; sub(/^[ \t]*/,"",s); print s}'

This variation is both fast and safe as no parsing into fields is involved.

(As pointed out elsewhere, one advantage of using awk is that ORS can be used to set the output-record-separator if the default setting is unsatisfactory.)

edited Feb 11 '16 at 21:50

answered Feb 11 '16 at 04:32

peak

105,803
17
152
177

++; while not as efficient as a single-utility `awk` solution, this is certainly the simplest if you want to stick with `sed`. As in the OP's own approach, the `g` is pointless, since the regex is anchored to the start of the line. – mklement0 Feb 11 '16 at 15:42
@mklement0 - The g was just a relic of the OP's sed. Gone. As for "efficiency" it may depend on criteria and other factors. Have you measured? – peak Feb 11 '16 at 16:15
Good point - the difference may not matter in the real world, especially in terms of _speed_; in terms of _resources_, assuming the utilities involved use comparable amounts of memory: 1 process is better than one, not using a pipe (FIFO) is better than 1. Again, it may not matter in practice, but I'm curious now: do let me know what you find. – mklement0 Feb 11 '16 at 16:28
While I chose @mklement0's answer because the solution fit well with my purposes, this was also a good and simple solution. By the way, a quick test revealed that both of these options were just as quick (@mklement0's solution was faster by 0.0001 second for what it's worth). – dispepsi Feb 11 '16 at 17:40
Thank you for your thoughtful feedback, @celestialroad - especially the relative performance comparison; we can conclude that the two solutions perform the same, time-wise. – mklement0 Feb 11 '16 at 20:18
@mklement0 - The one-stop awk solution can be significantly faster than the pipeline, so I've revised this entry accordingly. Thanks! – peak Feb 11 '16 at 21:52
Kudos for the revised `awk` command - not referencing the field variables (`$1`, ...) indeed seems to be faster - never knew about this optimization. Perhaps you chose to be explicit, but just to note the more concise alternative: `awk -v RS='[\r]' '{ sub(/^[ \t]*/, ""); print }'` is sufficient, given that `sub()` operates on `$0` by default. It sounds like you think that `sed`'s `-e` option performs in-place updating, but it doesn't; that's what `-i ` is for (note that `` is _mandatory_ with BSD Sed, unlike with GNU Sed). – mklement0 Feb 12 '16 at 04:22
@mklement0 - Yes, in SO answers especially, I think it makes sense to minimize the clevernesses. I'm not sure why you had the impression that I was mixing up -e with -i but rest assured, I wasn't :-) – peak Feb 12 '16 at 04:27
Re `-e` vs. `-i`: The following sentence confused me (emphasis mine), `If you have no interest in saving the *original* file with the CR replaced by LF`, but I now see that you were referring to the _copy_ of the original that your `tr` command created. Given that it's far likelier that the _result_ of the trimming will be saved to a file (not a line-endings-only-modification of the _input_), your original pipeline strikes me as the better approach to the `sed` solution. – mklement0 Feb 12 '16 at 04:49

sed only replaces first leading white space match for only a particular file - dealing with CR-only line endings

2 Answers2