0

I'm just curious why the output of the sed command in these two environments is different:

  1. command:

echo "xxx-MNP_ISS_DE-5.12.0.37-quality.zip"|sed 's#^[a-z,A-Z,-.\_]*##'

5.12.0.37-quality.zip

System info: i)echo $0

bash

ii)uname -a

Linux xxxx 3.10.0-1062.1.1.el7.x86_64 #1 SMP Fri Sep 13 22:55:44 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
  1. command:

echo "xxx-MNP_ISS_DE-5.12.0.37-quality.zip"|sed 's#^[a-z,A-Z,-.\_]*##'

-MNP_ISS_DE-5.12.0.37-quality.zip

System info: i)echo $0

-bash

ii)uname -a

Linux xxxx 5.10.16.3-microsoft-standard-WSL2 #1 SMP Fri Apr 2 22:23:49 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Thanks in advance!

User123
  • 1,498
  • 2
  • 12
  • 26
  • I don't see how this could be related to the kernel. I would check for differences in the `sed` version. BTW, I don't know what you want to achieve exactly, but using `sed` for this task is probably overkill anyway. – user1934428 Sep 23 '21 at 06:24
  • 2
    Check the `locale` in the two environments. And I think you're not using the right syntax for the bracket expression; in a bracket expression, you don't use commas to separate things, it's just another character, and dashes indicate a range unless they occur in a place where they can't (like as the first char) (so `[,-.]` indicates a range of characters starting with `,` and ending with `.`). I think you want `[-a-zA-Z.\_]` or maybe `[-[:alpha:].\_]` – Gordon Davisson Sep 23 '21 at 06:25
  • You have a strange idea of what a kernel does if you think it could have an effect like this. – Barmar Sep 23 '21 at 06:31
  • How old is the machine with `Linux xxxx 3.10` on it ans what version of `bash` is it running? The 3.10 kernel was release June 2013?? – David C. Rankin Sep 23 '21 at 06:32
  • @user1934428 and @Barmar : I'm bit naïve about the core concepts related to kernel, was just my wrong speculation, so please forgive me on that :-)...but at least this isn't due to difference in `sed` version as i checked...both have `sed (GNU sed) 4.2.2` – User123 Sep 23 '21 at 06:39
  • 1
    As @GordonDavisson pointed, i believe it could be due to difference in `locale` settings, i found the settings as : `LC_CTYPE="POSIX"` and `LC_CTYPE="en_US.UTF-8"` respectively on those two different systems...and it could be due to the 'wrong' syntax for the bracket expression i had been using having `,` in between: `[a-z,A-Z,-.\_]` that my output was different in those environments. With this syntax : `[-a-zA-Z.\_]`, i can see the same output in both the environments now....thanks :-) – User123 Sep 23 '21 at 06:44
  • @DavidC.Rankin: the `bash` version is : `GNU bash, version 4.2.46(2)-release (x86_64-redhat-linux-gnu)` – User123 Sep 23 '21 at 06:47
  • @DavidC.Rankin Why would the version of `bash` affect what `sed` does? – Barmar Sep 23 '21 at 07:01
  • It wouldn't I was curios oh how old the version was. It clicked later that RedHat uses old kernels (patched for stability and security, etc..) – David C. Rankin Sep 23 '21 at 07:31
  • @GordonDavisson : While the locale does effect the character class, I don't see how the difference in the locale could affect the outcome with this particular input. Also, the comma are weird, but this too should not matter in the concrete example. – user1934428 Sep 23 '21 at 07:38
  • @User123 : I tried to repeat your case with the two different locale settings you mentioned, and I got the same result (_5.12.0.37-quality.zip_) in both cases. Is the case where you get _-MNP_ISS_DE-5.12.0.37-quality.zip_ reproducible? What happens if you prefix the line with the word _command_, i.e. `command echo "xxx-MNP_ISS_DE-5.12.0.37-quality.zip"|sed 's#^[a-z,A-Z,-.\_]*##'`? – user1934428 Sep 23 '21 at 07:42
  • @user1934428: with your given command, i'm getting the output as : `-MNP_ISS_DE-5.12.0.37-quality.zip` where the `locale` setting is : `LC_CTYPE="en_US.UTF-8"` and the output as : `5.12.0.37-quality.zip` where the `locale` setting is : `LC_CTYPE="POSIX"` – User123 Sep 23 '21 at 07:46
  • @User123 : From this, I would conclude that on the "ill-behaving" system, the third _x_ of the initial _xxx_ is something different than we think it is. Could you for a test replace the `sed` command in both cases by a simple `xxd`, i.e. `echo .... | xxd`? – user1934428 Sep 23 '21 at 08:08

1 Answers1

1

Short answer: the character range is malformed, and is running into what appears to be a bug in GNU sed v4.2.2's unicode character range handling. Use [-[:alpha:].\_] instead (assuming the backslash is actually supposed to be one of the characters to trim; if not, remove that from the bracket expression).

Long answer, part 1: In a regex bracket expression (like [some characters]), commas are not needed to separate entries, and will instead be treated as part of the list of characters to match. On the other hand, dashes in most contexts are treated as part of a range (e.g. a-z) rather than as literal characters themselves. Thus, the bracket expression [a-z,A-Z,-.\_] is parsed as including:

  • The character range a through z
  • The character ,
  • The character range A through Z
  • The character range , through .
  • The characters \ and _

This isn't quite what's intended, but is mostly close enough to to what's expected. Mostly. Except for the , through . range.

Long answer, part 2: The character , is hex 2C (decimal 44) in ASCII, and U+002C in Unicode. The . is hex 2E (46) in ASCII and U+002E in Unicode. In both ASCII and Unicode, the character between them happens to be -. This means that if character ranges follow the order of the character codes, the range ,-. just happens to correspond to those same three characters: ,, -, and .. The POSIX locale just uses the ASCII order, so that's exactly what the range corresponds to, but Unicode locales can be more complicated.

I do not understand Unicode's collation (sorting) rules properly, but with everything I've tested except GNU sed v4.2.2, the character range corresponds to just ,, -, and . in the en_US.UTF-8 locale (and that includes testing with GNU sed v4.7). So I'm fairly sure that's what it should correspond to.

But not with GNU sed v4.2.2. In that, ,-. corresponds to a different bunch of punctuation characters:

$ chars=' !"#$%&'\''()*+,-./0123456789:;<=>?'
$ range=',-.'
$ echo "$chars"; echo "$chars" | LC_ALL=en_US.UTF-8 sed "s#[$range]#X#g"
 !"#$%&'()*+,-./0123456789:;<=>?
 X"#$%&'()*+X-XX0123456789XX<=>X
$ sed --version
sed (GNU sed) 4.2.2
Copyright (C) 2012 Free Software Foundation, Inc.
[...]

...so with this sed version and locale, it's interpreting the range ,-. as including !, /, :, :, and ? (in addition to the , and . that're the endpoints of the range).

I have no idea why it does this, but it appears to be a bug that got fixed somewhere between versions 4.2.2 and 4.7.

BTW, weirdness like this is why it's generally better to use [[:alpha:]] instead of [a-zA-Z] -- depending on the locale, those ranges might correspond to something quite different from what you expect (see this question for an example).

Gordon Davisson
  • 118,432
  • 16
  • 123
  • 151