3

Using the command echo "Jiro. Inagaki' & Soul, Media_Breeze." | tr -d '[:punct:]' prints the string "Jiro Inagaki Soul MediaBreeze".

However, I want to find a regular expression that will remove all punctuation except the underscore and ampersand i.e. I want "Jiro Inagaki & Soul Media_Breeze".

Following advice on character class subtraction from the sources listed at the bottom, I've tried replacing [:punct:] with the following:

  • [\p{P}\-[&_]]
  • [[:punct:]-[&_]]
  • (?![\&_])\p{P}
  • (?![\&_])[:punct:]
  • [[:punct:]-[&_]]
  • [[:punct:]&&[&_]]
  • [[:punct:]&&[^&_]]

... but I haven't gotten anything to work so far. Any help would be much appreciated!

Sources:

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
EET FEK
  • 33
  • 5
  • 1
    [`tr` does not use regular-expressions](https://ss64.com/bash/tr.html). It’s syntax superficially resembles the popular PCRE dialect, but it isn’t a regex evaluator. Consider using `sed` and `awk` instead of `tr`. – Dai Jun 01 '21 at 23:55
  • 1
    I don't think this is regex, it sounds like OP just wants to remove punctuation; `tr` should be fine for this – jared_mamrot Jun 01 '21 at 23:58

2 Answers2

4

You can specify the punctuation marks you want removed, e.g.

>echo "Jiro. Inagaki' & Soul, Media_Breeze." | tr -d "[.,/\\-\=\+\{\[\]\}\!\@\#\$\%\^\*\'\\\(\)]"
Jiro Inagaki & Soul Media_Breeze

Or, alternatively,

>echo "Jiro. Inagaki' & Soul, Media_Breeze." | tr -dc '[:alnum:] &_'
Jiro Inagaki & Soul Media_Breeze
jared_mamrot
  • 22,354
  • 4
  • 21
  • 46
  • The second solution worked perfect, thanks! – EET FEK Jun 02 '21 at 00:16
  • No problem. I thought this would be a duplicate question, but I couldn't find anything similar on SO. Glad it solved your issue. – jared_mamrot Jun 02 '21 at 00:24
  • I think the solution is a bit wrong. It should be `tr -dc '[:alnum:] &_'`. The `[]` in `tr` is only for character classes and is otherwise already a character set, so it doesn't have bracket expressions. The brackets will cause `tr` to not delete brackets. `echo '[hi &_123]' | tr -dc '[:alnum:][ &_]' # [hi &_123]`. `echo '[hi &_123]' | tr -dc '[:alnum:] &_' # hi &_123`. The un-complemented solution should also be adjusted. I don't think it needs so many backslash escapes. – dosentmatter Jul 05 '22 at 17:50
  • The un-complemented solution can have less or no escapes if you use single quotes. You also don't need to escape brackets if they aren't in the character class form `[:class:]`. – dosentmatter Jul 05 '22 at 17:59
  • 1
    You're absolutely right @dosentmatter, thank you for the correction. Regarding the backslash escapes, I wasn't able to find a working solution apart from the one posted; I would be very interested to see your alternative with single quotes – jared_mamrot Jul 05 '22 at 23:21
  • No problem. I'd implement it like this with single quotes ``echo -n 'abcd_!"#$%()*+,-./:;<=>?@][\\^`{}|~'"'" | tr -d '!"#$%()*+,-./:;<=>?@][\\^`{}|~'"'"``. The only characters that need special handling are backslash (escaped with ``\\``) and single quote (concatenated at end with separate double quote string`"'"`. I didn't follow your list of punctuation exactly. I based mine off of [the list in GNU docs](https://www.gnu.org/software/grep/manual/html_node/Character-Classes-and-Bracket-Expressions.html#index-punct-character-class), using similar order. I included all punctuation except `_`. – dosentmatter Jul 06 '22 at 07:02
  • 1
    If possible, can you please post that as an answer to the question @dosentmatter? It's clearly 'more correct' than my answer and I really appreciate you taking the time to explain it further – jared_mamrot Jul 06 '22 at 10:13
1

Posting my comment as an answer as requested by @jared_mamrot.

You can manually type out the set of punctuation, excluding _, that you want to delete. I took my punctuation set from GNU docs on [:punct:]:

‘[:punct:]’ Punctuation characters; in the ‘C’ locale and ASCII character encoding, this is ! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~.

You can also look at POSIX docs which says the character classes depend on locale:

punct    <exclamation-mark>;<quotation-mark>;<number-sign>;\
         <dollar-sign>;<percent-sign>;<ampersand>;<apostrophe>;\
         <left-parenthesis>;<right-parenthesis>;<asterisk>;\
         <plus-sign>;<comma>;<hyphen>;<period>;<slash>;\
         <colon>;<semicolon>;<less-than-sign>;<equals-sign>;\
         <greater-than-sign>;<question-mark>;<commercial-at>;\
         <left-square-bracket>;<backslash>;<right-square-bracket>;\
         <circumflex>;<underscore>;<grave-accent>;<left-curly-bracket>;\
         <vertical-line>;<right-curly-bracket>;<tilde>
$ echo 'abcd_!"#$%()*+,-./:;<=>?@][\\^`{}|~'"'" | tr -d '!"#$%()*+,-./:;<=>?@][\\^`{}|~'"'"
abcd_

The set of characters in the tr command should be straightforward except for backslash, \\, which has been escaped for tr, and single quote, "'", which is being concatenated as a string quoted in double quotes, since you can't escape a single quote within single quotes.

I do prefer using @jared_marmot's complement solution, if possible, though. It is much neater.

dosentmatter
  • 1,494
  • 1
  • 16
  • 23