2

I have a file s.csv

a,b+ -.,c 
aa,bb ().,c._c

I want to remove all special characters from 2nd column (file separated by comma)

cat s.csv | tr -dc '[:alnum:]\n\r' | tr '[:upper:]' '[:lower:]'

The above code also removes special characters from 3rd column as well.

awk -F, '{print $2}' s.csv | tr -dc '[:alnum:]\n\r' | tr '[:upper:]' '[:lower:]'

This code only print 2nd column.

Any idea how can I remove special char from 2nd column and price all

Required output should be

a,b,c 
aa,bb,c._c 
Amit Singh
  • 188
  • 9
  • Your code `tr -dc '[:alnum:]\n\r'` means you want to retain `\n`s when they appear in your 2nd column. That implies you can have `\n`s in your 2nd column. Given that, none of the line-oriented approaches so far can work. Please [edit] your question to tell is for sure if your fields can contain **newlines or commas** and show sample input/output that includes such cases. – Ed Morton May 26 '22 at 13:22

4 Answers4

3

Remove all (from second field)

  1. characters that are not upper case letters [^A-Z
  2. or lower case letters a-z
  3. or digits 0-9]
  4. from second field $2
  5. fields are with "," separated -F ','
  6. keep the separator in output OFS=FS

$ awk -F ',' 'BEGIN{OFS=FS}{gsub(/[^A-Za-z0-9]/,"",$2); print}' s.csv


# test
$ awk -F ',' 'BEGIN{OFS=FS}{gsub(/[^A-Za-z0-9]/,"",$2); print}' <<<'aa,bb ().,c._c'
aa,bb,c._c

As @Léa Gris mentioned below

Don't forget to set the locale to C or [^A-Za-z0-9] is gonna be interpreted unexpectedly in non-western European alphabets. Prepend awk invocation with LC_ALL=C

ufopilot
  • 3,269
  • 2
  • 10
  • 12
  • Don't forget to set the locale to `C` or `[^A-Za-z0-9]` is gonna be interpreted unexpectedly in non-western European alphabets. Prepend awk invocation with `LC_ALL=C`, like `LC_ALL=C awk -F ',' 'BEGIN{OFS=FS}{gsub(/[^A-Za-z0-9]/,"",$2); print}' s.csv`. Unfortunately, there is no way to change the locale settings within an awk script, so the locale settings need to be set before running awk. – Léa Gris May 26 '22 at 10:29
0

You can use the [:alpha:] character class using awk, here for second field and remove with gsub() function the characteres that aren't alpha:

awk 'BEGIN{OFS=FS=","} {gsub(/[^[:alpha:]]+/, "", $2)} 1' file
a,b,c
aa,bb,c._c
Carlos Pascual
  • 1,106
  • 1
  • 5
  • 8
0

Use this Perl one-liner:

perl -F',' -lane '$F[1] =~ s{[\W_]+}{}g; @F = map { lc } @F; print join ",", @F; ' in_file > out_file

The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-n : Loop over the input one line at a time, assigning it to $_ by default.
-l : Strip the input line separator ("\n" on *NIX by default) before executing the code in-line, and append it when printing.
-a : Split $_ into array @F on whitespace or on the regex specified in -F option.
-F',' : Split into @F on comma, rather than on whitespace.

s{[\W_]+}{} : Replace 1 or more occurrences of \W (non-word character) or underscore with nothing.

The regex uses these modifiers:
/g : Match the pattern repeatedly.

SEE ALSO:
perldoc perlrun: how to execute the Perl interpreter: command line switches
perldoc perlre: Perl regular expressions (regexes)
perldoc perlre: Perl regular expressions (regexes): Quantifiers; Character Classes and other Special Escapes; Assertions; Capture groups
perldoc perlrequick: Perl regular expressions quick start

Timur Shtatland
  • 12,024
  • 2
  • 30
  • 47
0

You don't have to alter locale just to do it - by using octals instead of letters, the regex engine respects them as ASCII instead of being overly clever - i even intentionally set it to Belgian French to illustrate :

CODE

echo 'a,b+ -.,c 
aa,bb ().,c._c' | {m,g}awk '

gsub("[^\\060-\\071\\101-\\132\\141-\\172]+","",$(!_+!_))^_' \
                                              OFS=',' FS=','

OUTPUT

a,b,c 
aa,bb,c._c

SHOWCASE LOCALE=C isn't needed

 LANG="fr_BE.UTF8" gawk -e '
 BEGIN { for(_=8*4;_<8^4;_++) { printf("%c",_) } } ' | 
 
 LANG="fr_BE.UTF8" gawk -p- -e '
 gsub("[^\\060-\\071\\101-\\132\\141-\\172]+","",$-_)^_' OFS=',' FS=','

——————————

0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz

# profile gawk, cr'e'e Sun May 29 05:58:26 2022

# R`egle(s)

 1  (gsub("[^\\060-\\071\\101-\\132\\141-\\172]+", "", $-_)) ^ _ { # 1
 1      print
}
RARE Kpop Manifesto
  • 2,453
  • 3
  • 11