0

See the following cleanCustomer.sh file

#!/bin/bash
customer=Reportçós
cleanedCustomer=${customer//[^a-zA-Z0-9 \-_.]/}
echo $cleanedCustomer

When I run it on Windows 11 in Git Bash it prints Reports.
When I run it on CentOS in terminal it prints Reportçós.

Anybody knows why is a-z interpreted as alpha characters in CentOS and not in Windows?
How do I ensure only english characters are considered in the CentOS?

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • 2
    Looks like collations. Try `LC_ALL=C cleanedCustomer=${customer//[^a-zA-Z0-9 _.-]/}`. I guess `\-` was used to escape `-`. Also, it is not a regex, it is a glob pattern. – Wiktor Stribiżew May 13 '22 at 12:49
  • thanks. Took me some time to find documentation that is a glob and not regex: [see ${parameter/pattern/string}](https://www.gnu.org/software/bash/manual/html_node/Shell-Parameter-Expansion.html#Shell-Parameter-Expansion) – Krakowski mudrac May 13 '22 at 13:50

1 Answers1

1

From the bash manual:

A pair of characters separated by a hyphen denotes a range expression; any character that falls between those two characters, inclusive, using the current locale’s collating sequence and character set, is matched. If the first character following the ‘[’ is a ‘!’ or a ‘^’ then any character not enclosed is matched. A ‘-’ may be matched by including it as the first or last character in the set.

Your Git Bash locale uses rules that don't match accented characters in ranges like a-z, your CentOS locale does. This can be addressed by using a consistent locale like C for collation. Plus your - is in the wrong spot; it needs to be first or last, and the backslash needs to be escaped with another backslash to match a literal one.

#!/bin/bash
LC_COLLATE=C
customer=Reportçós
cleanedCustomer=${customer//[^a-zA-Z0-9 \\_.-]/}
printf "%s\n" "$cleanedCustomer"
Shawn
  • 47,241
  • 3
  • 26
  • 60
  • thanks. Although I'll make a correction to your correction since you dropped "\" (it is a literal, not an escape character). You are correct that the `-` needs to go to the end (or start) to be matched. `cleanedCustomer=${customer//[^a-zA-Z0-9 \_.-]/}` The following characters (excluding alphanumerics) are meant to be "cleaned": ` ` (space), "\", `_`, `.`, `-`. But this is kinda out of scope of the question. The `LC_COLLATE` does the job/is the answer. – Krakowski mudrac May 13 '22 at 13:06
  • 1
    @Krakowskimudrac You were trying to match a literal backslash? That *will* have to be escaped in the pattern - two backslashes, not one. – Shawn May 13 '22 at 13:11
  • not correct @Shawn - i just tested. Only 1 backslash is fine for literal match. – Krakowski mudrac May 13 '22 at 13:13
  • Wasn't working for me but I might have messed up somewhere. (Tried again, and I need the doubled backslash to avoid stripping them from a string. bash 4.4.20). – Shawn May 13 '22 at 13:23
  • you are right! I had messed something up! So this is the regex I need: `${customer//[^a-zA-Z0-9 \\_.-]/}` – Krakowski mudrac May 13 '22 at 13:29