3

\s does not seem to work with

sed 's/[\s]\+//' tempfile

while it is working for

sed 's/[ ]\+//' tempfile

I am trying to remove white spaces that are coming at the beginning of each line due to the command:

nl -s ') ' file > tempfile  

e.g. file:

A Storm of Swords, George R. R. Martin, 1216
The Two Towers, J. R. R. Tolkien, 352
The Alchemist, Paulo Coelho, 197
The Fellowship of the Ring, J. R. R. Tolkien, 432
The Pilgrimage, Paulo Coelho, 288
A Game of Thrones, George R. R. Martin, 864

tempfile:

 1) Storm of Sword, George R. R. Martin, 1216
 2) The Two Tower, J. R. R. Tolkien, 352
 3) The Alchemit, Paulo Coelho, 197
 4) The Fellowhip of the Ring, J. R. R. Tolkien, 432
 5) The Pilgrimage, Paulo Coelho, 288
 6) A Game of Throne, George R. R. Martin, 864

i.e. there are spaces before numbers

Please explain why the white spaces are coming and the reason for \s to not work.

Thor
  • 45,082
  • 11
  • 119
  • 130
  • @Cyrus I am sure that you need to escape the + sign but why does \s does not work inside list? –  Sep 03 '17 at 06:44
  • This might help: `sed -r 's/[[:space:]]+//' file` – Cyrus Sep 03 '17 at 06:46
  • @Cyrus I know how to make it work. I'd like to know why \s and things like these don't work sometimes. also I don't think your code will work without escaping the +. Also I am not aware of [:space:] please link some documentation to these things. I always use \s for space though I am on a Mac and had to get gnu sed for that –  Sep 03 '17 at 06:49
  • 1
    sed uses the regex engine that is implemented in the standard library, and that is OS dependent. Some do not support \s at all. And \s in brackets never works with sed. Use `s/\s*//` or the suggestion of cyrus. – yacc Sep 03 '17 at 07:09
  • What exactly do you mean by "sometimes"? Give some examples of working and some of not working and give details of environmental differences. – Yunnosch Sep 03 '17 at 08:46
  • 1
    Perhaps because the default width for `nl` is six characters wide? Try `sed 's/^\s\s*//' file`. – potong Sep 03 '17 at 10:22

2 Answers2

4

The reason is simple: POSIX regex engine does not parse shorthand Perl-like character classes as such inside bracket expressions.

See this reference:

One key syntactic difference is that the backslash is NOT a metacharacter in a POSIX bracket expression. So in POSIX, the regular expression [\d] matches a \ or a d.

So, [\s] in a POSIX regex matches one of two symbols: either \ or s.

Consider the following demo:

echo 'ab\sc' | sed 's/[\s]\+//'

Output is abc. \s substring is removed.

Consider using POSIX character classes instead of Perl-like shorthands:

echo 'ab\s c' | sed 's/[[:space:]]\+//'

See this online demo (the output is ab\sc). The POSIX character classes are made of [:<NAME_OF_CLASS>:], and they can only be used inside bracket expressions. See more examples of POSIX character classes here.

NOTE: if you want to make sure the spaces at the start of the line are removed, add ^ at the pattern start:

sed 's/^[[:space:]]\+//'
       ^ 

MORE PATTERNS:

  • \w = [[:alnum:]_]
  • \W = [^[:alnum:]_]
  • \d = [[:digit:]] (or [0-9])
  • \D = [^[:digit:]] (or [^0-9])
  • \h = [[:blank:]]
  • \S = [^[:space:]]
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • 1
    A late-to-the-game comment, but [0-9] is not equivalent in all charsets to :digit: or \d. Just a tip for multilingual support. See the top voted answer here: https://stackoverflow.com/questions/6479423/does-d-in-regex-mean-a-digit – Steve Midgley Feb 04 '22 at 17:18
1

You could also format the numbers without fixed width. From coreutils.info:

‘-w NUMBER’
‘--number-width=NUMBER’
     Use NUMBER characters for line numbers (default 6).

E.g.:

nl -w 1 -s ') ' infile

Output:

1) A Storm of Swords, George R. R. Martin, 1216
2) The Two Towers, J. R. R. Tolkien, 352
3) The Alchemist, Paulo Coelho, 197
4) The Fellowship of the Ring, J. R. R. Tolkien, 432
5) The Pilgrimage, Paulo Coelho, 288
6) A Game of Thrones, George R. R. Martin, 864
Thor
  • 45,082
  • 11
  • 119
  • 130