1

I need to clean up some text and am trying to remove numbers when they appear in parentheses. If there is more then that should remain.

Examples:

Foo 12 (bar, 13) -> Foo 12 (bar)
Foo 12 (13, bar, 14) -> Foo 12 (bar) 
Foo (14, 13) -> Foo

I thought I would start by breaking up the string and removing numbers if they appear between parentheses but it seems that I am missing something.

echo "Foo 12 (bar, 12)" | sed 's/\(.*\)\((\)\([^0-9,].*\)\([, ].*\)\([0-9].*\)\()\)/\1\2\3\6/g'

results in Foo 12 (bar,).

I guess my approach is too atomic. What can I do?

Barmar
  • 741,623
  • 53
  • 500
  • 612
PiEnthusiast
  • 314
  • 1
  • 4
  • 19

4 Answers4

1

If you have no problem with Perl, you could try this.

$ perl -pe 's/\s*,?\s*\b\d+\b\s*,?\s*(?=[^()]*\))//g;s/\h*\(\)$//' file
Foo 12 (bar)
Foo 12 (bar)
Foo

OR

$ perl -pe 's/(?:(?<=\()\d+,\h*|,?\h*\d+\b)(?=[^()]*\))//g;s/\h*\(\)$//' file
Foo 12 (bar)
Foo 12 (bar)
Foo

DEMO

Avinash Raj
  • 172,303
  • 28
  • 230
  • 274
  • Thank you for showing your approach and linking to the demo. I wouldn't know the perl way and wouldn't want to come across as ungrateful but to complete the task spaces after opening and before closing parentheses should not occur and parentheses that no longer have content should also be removed. I guess you would also need some kind of loop. – PiEnthusiast Jan 12 '15 at 12:14
1

Here's a general approach for problems like this, where you want to isolate a specific token and work on it, adapted for your problem:

#!/bin/sed -f

:loop                       # while the line has a matching token
/([^)]*[0-9]\+[^)])/ {      
  s//\n&\n/                 # mark it -- \n is good as a marker because it is
                            # nowhere else in the line
  h                         # hold the line!
  s/.*\n\(.*\)\n.*/\1/      # isolate the token

  s/[0-9]\+,\s*//g          # work on the token. Here this removes all numbers
  s/,\s*[0-9]\+//g          # with or without commas in front or behind
  s/\s*[0-9]\+\s*//g
  s/\s*()//                 # and also empty parens if they exist after all that.

  G                         # get the line back
                            # and replace the marked token with the result of the
                            # transformation
  s/\(.*\)\n\(.*\)\n.*\n\(.*\)/\2\1\3/

  b loop                    # then loop to get all such tokens.
}

To those who argue that this goes beyond the scope of what should reasonably be done with sed I say: True, but...well, true. But if all you see is nails, this is a way to make sed into a sledgehammer.

This can of course be written inline (although that does not help readability):

echo 'Foo 12 (bar, 12)' | sed ':loop;/([^)]*[0-9]\+[^)])/{;s//\n&\n/;h;s/.*\n\(.*\)\n.*/\1/;s/[0-9]\+,\s*//g;s/,\s*[0-9]\+//g;s/\s*[0-9]\+\s*//g;s/\s*()//;G;s/\(.*\)\n\(.*\)\n.*\n\(.*\)/\2\1\3/;b loop}'

but my advice is to put it into a file and run echo 'Foo 12 (bar, 12)' | sed -f foo.sed. Or, with the shebang like above, chmod +x foo.sed and echo 'Foo 12 (bar, 12)' | ./foo.sed.

I have not benchmarked this, by the way. I imagine that it is not the most efficient way to process large amounts of data.

EDIT: In response to the comments: I'm not sure what OP wants in such cases, but for the sake of completion, the basic pattern could be adapted for the other behavior like this:

#!/bin/sed -f

:loop
/(\s*[0-9]\+\s*)\|(\s*[0-9]\+\s*,[^)]*)\|([^)]*,\s*[0-9]\+\s*)\|([^)]*,\s*[0-9]\+\s*,[^)]*)/ {
  s//\n&\n/
  h
  s/.*\n\(.*\)\n.*/\1/

  s/,\s*[0-9]\+\s*,/,/g
  s/(\s*[0-9]\+\s*,\s*/(/
  s/\s*,\s*[0-9]\+\s*)/)/
  s/\s*(\s*[0-9]*\s*)//

  G
  s/\(.*\)\n\(.*\)\n.*\n\(.*\)/\2\1\3/

  b loop
}

The regex at the top looks a lot scarier now. It should help to know that it consists of the four subpatterns

(\s*[0-9]\+\s*)
(\s*[0-9]\+\s*,[^)]*)
([^)]*,\s*[0-9]\+\s*)
([^)]*,\s*[0-9]\+\s*,[^)]*)

which are or-ed together with \|. This should cover all cases and not match things like foo12, 12bar, and foo12bar in parentheses (unless there's a standalone number in them as well).

Wintermute
  • 42,983
  • 5
  • 77
  • 80
  • Apart from solving the problem your approach is also very educational. I knew that several patterns would be needed but wouldn't have dreamed that it would be feasible in this way. Thank you very much. – PiEnthusiast Jan 12 '15 at 12:04
  • very good explaination but failed on something like (bar14, 15) – NeronLeVelu Jan 12 '15 at 12:25
  • Thanks again. Actually "Foorbar (13)" is also possible. In this case the brackets should be removed. – PiEnthusiast Jan 13 '15 at 05:26
  • That works for me with both scripts. What is the output you get for it? – Wintermute Jan 13 '15 at 08:49
  • You are right @Wintermute, I broke it by trying to make it work with multiline text by adding `:a;N;$!ba;s/\n/ /g` (to replace line breaks, from http://stackoverflow.com/questions/1251999/sed-how-can-i-replace-a-newline-n) before the loop. – PiEnthusiast Jan 13 '15 at 14:59
  • OK, I give up, @Wintermute, what am I missing to make it work with text that is originally multiline which should become one line (without duplicate spaces which may have existed in the original)? – PiEnthusiast Jan 19 '15 at 20:52
  • Prepending `:a;N;$!ba;s/\n/ /g` works for me, as long as the input has at least two lines (that's a shortcoming of the pattern). Try `:a;$!{N;ba};s/\n/ /g` in case you have only a single line/have to consider that case. The difference between the two is that my version checks if the end of the input has been reached before trying to fetch another line. – Wintermute Jan 19 '15 at 21:15
  • I might have had a typo in my last attempt. Thank you, @Wintermute. – PiEnthusiast Jan 20 '15 at 07:12
1

Here is an awk version:

awk -F' *\\(|\\)' '{for (i=2;i<=NF;i+=2) {n=split($i,a," *, *");f="";for (j=1;j<=n;j++) f=f (a[j]!~/[[:digit:]]/?a[j]",":""); $i=f?"("f")":"";sub(/,)/,")",$i)}}1' file
Foo 12 (bar)
Foo 12 (bar)
Foo

cat file

Foo 12 (bar, 13, more)
Foo 12 (13, bar, 14) (434, tar ,56)
Foo (14, 13)

awk -F' *\\(|\\)' '{for (i=2;i<=NF;i+=2) {n=split($i,a," *, *");f="";for (j=1;j<=n;j++) f=f (a[j]!~/[[:digit:]]/?a[j]",":""); $i=f?"("f")":"";sub(/,)/,")",$i)}}1' file
Foo 12 (bar,more)
Foo 12 (bar)  (tar)
Foo

Some more readable:

awk -F' *\\(|\\)' '
    {
    for (i=2;i<=NF;i+=2) {
        n=split($i,a," *, *")
        f=""
        for (j=1;j<=n;j++) 
            f=f (a[j]!~/[[:digit:]]/?a[j]",":"")
            $i=f?"("f")":""
            sub(/,)/,")",$i)
        }
    }
1' file
Jotne
  • 40,548
  • 12
  • 51
  • 55
1
sed ':retry

# remove "( number )"
s/( *[0-9]* *)//

# remove first ", number" (not at first place)
s/^\(\([^(]*([^(]*)\)*[^(]*([^)]*\), *[0-9]\{1,\} *\([,)]\)/\1\3/
    t retry

# remove " number" (first place)
s/^\(\([^(]*([^(]*)\)*[^(]*(\) *[0-9]\{1,\}\(,\{0,1\}\)\()\{0,1\}\)]*/\1\4/

# case needed where only "( number)" or "()" are the result at this moment
t retry
' YourFile
  • (posix version so --POSIX on GNU sed)
NeronLeVelu
  • 9,908
  • 1
  • 23
  • 43