Numbers, parentheses and sed

Question

I need to clean up some text and am trying to remove numbers when they appear in parentheses. If there is more then that should remain.

Examples:

Foo 12 (bar, 13) -> Foo 12 (bar)
Foo 12 (13, bar, 14) -> Foo 12 (bar) 
Foo (14, 13) -> Foo

I thought I would start by breaking up the string and removing numbers if they appear between parentheses but it seems that I am missing something.

echo "Foo 12 (bar, 12)" | sed 's/\(.*\)\((\)\([^0-9,].*\)\([, ].*\)\([0-9].*\)\()\)/\1\2\3\6/g'

results in Foo 12 (bar,).

I guess my approach is too atomic. What can I do?

did you want an answer in perl? – Avinash Raj Jan 12 '15 at 10:33 — Avinash Raj, Jan 12 '15 at 10:33

Avinash Raj · Answer 1 · 2015-01-12T10:53:40.980

1

If you have no problem with Perl, you could try this.

$ perl -pe 's/\s*,?\s*\b\d+\b\s*,?\s*(?=[^()]*\))//g;s/\h*\(\)$//' file
Foo 12 (bar)
Foo 12 (bar)
Foo

OR

$ perl -pe 's/(?:(?<=\()\d+,\h*|,?\h*\d+\b)(?=[^()]*\))//g;s/\h*\(\)$//' file
Foo 12 (bar)
Foo 12 (bar)
Foo

DEMO

edited Jan 12 '15 at 10:53

answered Jan 12 '15 at 10:40

Avinash Raj

172,303
28
230
274

Thank you for showing your approach and linking to the demo. I wouldn't know the perl way and wouldn't want to come across as ungrateful but to complete the task spaces after opening and before closing parentheses should not occur and parentheses that no longer have content should also be removed. I guess you would also need some kind of loop. – PiEnthusiast Jan 12 '15 at 12:14

Wintermute · Accepted Answer · 2015-01-12T13:05:24.100

Here's a general approach for problems like this, where you want to isolate a specific token and work on it, adapted for your problem:

#!/bin/sed -f

:loop                       # while the line has a matching token
/([^)]*[0-9]\+[^)])/ {      
  s//\n&\n/                 # mark it -- \n is good as a marker because it is
                            # nowhere else in the line
  h                         # hold the line!
  s/.*\n\(.*\)\n.*/\1/      # isolate the token

  s/[0-9]\+,\s*//g          # work on the token. Here this removes all numbers
  s/,\s*[0-9]\+//g          # with or without commas in front or behind
  s/\s*[0-9]\+\s*//g
  s/\s*()//                 # and also empty parens if they exist after all that.

  G                         # get the line back
                            # and replace the marked token with the result of the
                            # transformation
  s/\(.*\)\n\(.*\)\n.*\n\(.*\)/\2\1\3/

  b loop                    # then loop to get all such tokens.
}

To those who argue that this goes beyond the scope of what should reasonably be done with sed I say: True, but...well, true. But if all you see is nails, this is a way to make sed into a sledgehammer.

This can of course be written inline (although that does not help readability):

echo 'Foo 12 (bar, 12)' | sed ':loop;/([^)]*[0-9]\+[^)])/{;s//\n&\n/;h;s/.*\n\(.*\)\n.*/\1/;s/[0-9]\+,\s*//g;s/,\s*[0-9]\+//g;s/\s*[0-9]\+\s*//g;s/\s*()//;G;s/\(.*\)\n\(.*\)\n.*\n\(.*\)/\2\1\3/;b loop}'

but my advice is to put it into a file and run echo 'Foo 12 (bar, 12)' | sed -f foo.sed. Or, with the shebang like above, chmod +x foo.sed and echo 'Foo 12 (bar, 12)' | ./foo.sed.

I have not benchmarked this, by the way. I imagine that it is not the most efficient way to process large amounts of data.

EDIT: In response to the comments: I'm not sure what OP wants in such cases, but for the sake of completion, the basic pattern could be adapted for the other behavior like this:

#!/bin/sed -f

:loop
/(\s*[0-9]\+\s*)\|(\s*[0-9]\+\s*,[^)]*)\|([^)]*,\s*[0-9]\+\s*)\|([^)]*,\s*[0-9]\+\s*,[^)]*)/ {
  s//\n&\n/
  h
  s/.*\n\(.*\)\n.*/\1/

  s/,\s*[0-9]\+\s*,/,/g
  s/(\s*[0-9]\+\s*,\s*/(/
  s/\s*,\s*[0-9]\+\s*)/)/
  s/\s*(\s*[0-9]*\s*)//

  G
  s/\(.*\)\n\(.*\)\n.*\n\(.*\)/\2\1\3/

  b loop
}

The regex at the top looks a lot scarier now. It should help to know that it consists of the four subpatterns

(\s*[0-9]\+\s*)
(\s*[0-9]\+\s*,[^)]*)
([^)]*,\s*[0-9]\+\s*)
([^)]*,\s*[0-9]\+\s*,[^)]*)

which are or-ed together with \|. This should cover all cases and not match things like foo12, 12bar, and foo12bar in parentheses (unless there's a standalone number in them as well).

Apart from solving the problem your approach is also very educational. I knew that several patterns would be needed but wouldn't have dreamed that it would be feasible in this way. Thank you very much. — PiEnthusiast, Jan 12 '15 at 12:04
very good explaination but failed on something like (bar14, 15) — NeronLeVelu, Jan 12 '15 at 12:25
Thanks again. Actually "Foorbar (13)" is also possible. In this case the brackets should be removed. — PiEnthusiast, Jan 13 '15 at 05:26
That works for me with both scripts. What is the output you get for it? — Wintermute, Jan 13 '15 at 08:49
You are right @Wintermute, I broke it by trying to make it work with multiline text by adding `:a;N;$!ba;s/\n/ /g` (to replace line breaks, from http://stackoverflow.com/questions/1251999/sed-how-can-i-replace-a-newline-n) before the loop. — PiEnthusiast, Jan 13 '15 at 14:59
OK, I give up, @Wintermute, what am I missing to make it work with text that is originally multiline which should become one line (without duplicate spaces which may have existed in the original)? — PiEnthusiast, Jan 19 '15 at 20:52
Prepending `:a;N;$!ba;s/\n/ /g` works for me, as long as the input has at least two lines (that's a shortcoming of the pattern). Try `:a;$!{N;ba};s/\n/ /g` in case you have only a single line/have to consider that case. The difference between the two is that my version checks if the end of the input has been reached before trying to fetch another line. — Wintermute, Jan 19 '15 at 21:15
I might have had a typo in my last attempt. Thank you, @Wintermute. — PiEnthusiast, Jan 20 '15 at 07:12

Jotne · Answer 3 · 2015-01-12T11:32:37.607

Here is an awk version:

awk -F' *\\(|\\)' '{for (i=2;i<=NF;i+=2) {n=split($i,a," *, *");f="";for (j=1;j<=n;j++) f=f (a[j]!~/[[:digit:]]/?a[j]",":""); $i=f?"("f")":"";sub(/,)/,")",$i)}}1' file
Foo 12 (bar)
Foo 12 (bar)
Foo

cat file

Foo 12 (bar, 13, more)
Foo 12 (13, bar, 14) (434, tar ,56)
Foo (14, 13)

awk -F' *\\(|\\)' '{for (i=2;i<=NF;i+=2) {n=split($i,a," *, *");f="";for (j=1;j<=n;j++) f=f (a[j]!~/[[:digit:]]/?a[j]",":""); $i=f?"("f")":"";sub(/,)/,")",$i)}}1' file
Foo 12 (bar,more)
Foo 12 (bar)  (tar)
Foo

Some more readable:

awk -F' *\\(|\\)' '
    {
    for (i=2;i<=NF;i+=2) {
        n=split($i,a," *, *")
        f=""
        for (j=1;j<=n;j++) 
            f=f (a[j]!~/[[:digit:]]/?a[j]",":"")
            $i=f?"("f")":""
            sub(/,)/,")",$i)
        }
    }
1' file

Thank you for sharing your approach and respect for your _awkward_ thinking. (SCNR) — PiEnthusiast, Jan 12 '15 at 12:08

NeronLeVelu · Answer 4 · 2015-01-12T13:22:27.523

sed ':retry

# remove "( number )"
s/( *[0-9]* *)//

# remove first ", number" (not at first place)
s/^\(\([^(]*([^(]*)\)*[^(]*([^)]*\), *[0-9]\{1,\} *\([,)]\)/\1\3/
    t retry

# remove " number" (first place)
s/^\(\([^(]*([^(]*)\)*[^(]*(\) *[0-9]\{1,\}\(,\{0,1\}\)\()\{0,1\}\)]*/\1\4/

# case needed where only "( number)" or "()" are the result at this moment
t retry
' YourFile

(posix version so --POSIX on GNU sed)

Numbers, parentheses and sed

4 Answers4