Replace string with substring in lowercase using sed / awk / tr / perl?

Question

I have a plaintext file containing multiple instances of the pattern $$DATABASE_*$$ and the asterisk could be any string of characters. I'd like to replace the entire instance with whatever is in the asterisk portion, but lowercase.

Here is a test file:

$$DATABASE_GIBSON$$

test me $$DATABASE_GIBSON$$ test me

$$DATABASE_GIBSON$$ test $$DATABASE_GIBSON$$ test

$$DATABASE_GIBSON$$ $$DATABASE_GIBSON$$$$DATABASE_GIBSON$$

Here is the desired output:

gibson

test me gibson test me

gibson test gibson test

gibson gibsongibson

How do I do this with sed/awk/tr/perl?

http://stackoverflow.com/q/4569825/318716 – Joseph Quinsey Oct 25 '12 at 17:17 — Joseph Quinsey, Oct 25 '12 at 17:17
http://stackoverflow.com/q/689495/318716 – Joseph Quinsey Oct 25 '12 at 17:19 — Joseph Quinsey, Oct 25 '12 at 17:19

score 3 · Accepted Answer · answered Oct 25 '12 at 19:25

3

Here's the perl version I ended up using.

perl -p -i.bak -e 's/\$\$DATABASE_(.*?)\$\$/lc($1)/eg' inputFile

answered Oct 25 '12 at 19:25

Nice solution indeed. Note however, that it won't work if `*` contains newlines. – mschilli Jul 07 '15 at 17:40

Gilles Quénot · Answer 2 · 2012-10-25T20:03:38.690

1

This one works with complicated examples.

perl -ple 's/\$\$DATABASE_(.*?)\$\$/lc($1)/eg' filename.txt

And for simpler examples :

echo '$$DATABASE_GIBSON$$' | sed 's@$$DATABASE_\(.*\)\$\$@\L\1@'

in sed, \L means lower case (\E to stop if needed)

edited Oct 25 '12 at 20:03

answered Oct 25 '12 at 17:16

Gilles Quénot

173,512
41
224
223

Not quite. I'm using this test file: http://pastebin.com/Q6RvvdcD And the output looks like this: http://pastebin.com/CBe0Mehb – Oct 25 '12 at 17:25
Added perl portable solution. – Gilles Quénot Oct 25 '12 at 17:28
With the same input file as above, using the perl, I get this: http://pastebin.com/y2uFq1Xk That one is seriously messing with formatting and deleting things. – Oct 25 '12 at 17:32
@anubhava - it doesn't work on OSX because \L an \E are GNU sed isms. This answer works in most Linux environments, but is not portable. – ghoti Oct 25 '12 at 17:43
1

FYI I'm in a FreeBSD environment. – Oct 25 '12 at 17:45
@BlueJ774 - as am I. OSX and FreeBSD use the same sed. – ghoti Oct 25 '12 at 17:50
See my new perl command in my POST : `perl -ple 's/\$\$DATABASE_(.*?)\$\$/lc($1)/eg' filename.txt` – Gilles Quénot Oct 25 '12 at 20:04
`\L` means `lc()` in Perl as well you know. – LeoNerd Oct 26 '12 at 11:53
@sputnick: What is the difference between your updated perl solution and the solution posted and accepted by the OP half an hour before your edit? – mschilli Aug 29 '13 at 05:12

score 1 · Answer 3 · answered Oct 25 '12 at 19:47

1

Unfortunately there's no easy, foolproof way with awk, but here's one approach:

$ cat tst.awk
{
   gsub(/[$][$]/,"\n")

   head = ""
   tail = $0

   while ( match(tail, "\nDATABASE_[^\n]+\n") ) {
      head = head substr(tail,1,RSTART-1)
      trgt = substr(tail,RSTART,RLENGTH)
      tail = substr(tail,RSTART+RLENGTH)

      gsub(/\n(DATABASE_)?/,"",trgt)

      head = head tolower(trgt)

   }

   $0 = head tail

   gsub("\n","$$")

   print
}

$ cat file
The quick brown $$DATABASE_FOX$$ jumped over the lazy $$DATABASE_DOG$$s back.
The grey $$DATABASE_SQUIRREL$$ ate $$DATABASE_NUT$$s under a $$DATABASE_TREE$$.
Put a dollar $$DATABASE_DOL$LAR$$ in the $$ string.

$ awk -f tst.awk file
The quick brown fox jumped over the lazy dogs back.
The grey squirrel ate nuts under a tree.
Put a dollar dol$lar in the $$ string.

Note the trick of converting $$ to a newline char so we can negate that char in the match(RE), without that (i.e. if we used ".+" instead of "[^\n]+") then due to greedy RE matching if the same pattern appeared twice on one input line the matching string would extend from the start of the first pattern to the end of the second pattern.

answered Oct 25 '12 at 19:47

Ed Morton

188,023
17
78
185

Nice code. Would you mind commenting on [my solution](http://stackoverflow.com/a/18484993/2451238)? I think I solved the problem with very little (`g`)`awk`. It should even work with newlines within the `*` string. But maybe I got something wrong. In this case I would like to lern from this. :) – mschilli Jul 07 '15 at 17:22
It doesn't produce the expected output from the sample input in the question. – Ed Morton Jul 07 '15 at 19:10
For me it does. Did you use GNU `awk` `gawk`? IIRC, POSIX `awk` does not support regular expression (RE) record separators (RS). If you tested it using `gawk`, what is the output you got and which version did you use? – mschilli Jul 07 '15 at 20:28
Yes I use gawk 4.1.1. The final line of output is `gibson gibson` with no terminating newline instead of `gibson gibsongibson` with a terminating newline. – Ed Morton Jul 07 '15 at 22:23
Thx for your input. The terminating newline was missing since `ORS` was empty in the case of the last record. Thus the assignment evaluated to false, not triggering the print. I fixed that by wrapping the assignment into an unconditioned *action* and adding an unconditioned `print` using the `1` idiom. However, the `$$DATABASE_GIBSON$$$$DATABASE_GIBSON$$` part is tramsformed to `gibsongibson` as expected for me. Can you double-check that this is stil not the case for you with my latest version? I'm on `gawk 4.0.2` so maybe sth changed since then. I'll try a recent `gawk` later today. Thx. – mschilli Jul 08 '15 at 05:54
My bad, I'd missed the last `$` off the input when I copy/pasted it. It's working now, looks good. – Ed Morton Jul 08 '15 at 12:31

ghoti · Answer 4 · 2012-10-25T19:10:58.013

Using awk alone:

> echo '$$DATABASE_AWESOME$$' | awk '{sub(/.*_/,"");sub(/\$\$$/,"");print tolower($0);}'
awesome

Note that I'm in FreeBSD, so this is not GNU awk.

But this can be done using bash alone:

[ghoti@pc ~]$ foo='$$DATABASE_AWESOME$$'
[ghoti@pc ~]$ foo=${foo##*_}
[ghoti@pc ~]$ foo=${foo%\$\$}
[ghoti@pc ~]$ foo=${foo,,}
[ghoti@pc ~]$ echo $foo
awesome

Of the above substitutions, all except the last one (${foo,,}) will work in standard Bourne shell. If you don't have bash, you can instead do use tr for this step:

$ echo $foo
AWESOME
$ foo=$(echo "$foo" | tr '[:upper:]' '[:lower:]')
$ echo $foo
awesome
$

UPDATE:

Per comments, it seems that what the OP really wants is to strip the substring out of any text in which it is included -- that is, our solutions need to account for the possibility of leading or trailing spaces, before or after the string he provided in his question.

> echo 'foo $$DATABASE_KITTENS$$ bar' | sed -nE '/\$\$[^$]+\$\$/{;s/.*\$\$DATABASE_//;s/\$\$.*//;p;}' | tr '[:upper:]' '[:lower:]'
kittens

And if you happen to have pcregrep on your path (from the devel/pcre FreeBSD port), you can use that instead, with lookaheads:

> echo 'foo $$DATABASE_KITTENS$$ bar' | pcregrep -o '(?!\$\$DATABASE_)[A-Z]+(?=\$\$)' | tr '[:upper:]' '[:lower:]'
kittens

(For Linux users reading this: this is equivalent to using grep -P.)

And in pure bash:

$ shopt -s extglob
$ foo='foo $$DATABASE_KITTENS$$ bar'
$ foo=${foo##*(?)\$\$DATABASE_}
$ foo=${foo%%\$\$*(?)}
$ foo=${foo,,}
$ echo $foo
kittens

Note that NONE of these three updated solutions will handle situations where multiple tagged database names exist in the same line of input. That's not stated as a requirement in the question either, but I'm just sayin'....

Close, but not quite with awk. Input: http://pastebin.com/Q6RvvdcD Output: http://pastebin.com/66HLeqgt — , Oct 25 '12 at 17:57
Those samples are not included in your question. I answered the question posted. — ghoti, Oct 25 '12 at 18:37
@BlueJ774 - updated my answer with your new requirements. You might want to be more explicit [in your question](http://stackoverflow.com/posts/13073727/edit) to avoid confusion. — ghoti, Oct 25 '12 at 18:55
Nice answer, but even your updated version does not do what (current version of) the question asks for: It will remove all input *not* to be transformed to lowercase instead of outputting it as-is. — mschilli, Jul 07 '15 at 17:36

score 0 · Answer 5 · answered Oct 25 '12 at 19:59

0

You can do this in a pretty foolproof way with the supercool command cut :)

echo '$$DATABASE_AWESOME$$' | cut -d'$' -f3 | cut -d_ -f2 | tr 'A-Z' 'a-z'

answered Oct 25 '12 at 19:59

miono

344
1
6

score 0 · Answer 6 · answered Oct 26 '12 at 08:29

0

This might work for you (GNU sed):

sed 's/$\$/\n/g;s/\nDATABASE_\([^\n]*\)\n/\L\1/g;s/\n/$$/g' file

answered Oct 26 '12 at 08:29

potong

55,640
6
51
83

mschilli · Answer 7 · 2015-07-08T05:46:42.747

0

Here is the shortest (GNU) awk solution I could come up with that does everything requested by the OP:

awk -vRS='[$][$]DATABASE_([^$]+[$])+[$]' '{ORS=tolower(substr(RT,12,length(RT)-13))}1'

Even if the string indicated with the asterix (*) contained one or more single Dollar signs ($) and/or linebreaks this soultion should still work.

edited Jul 08 '15 at 05:46

answered Aug 28 '13 at 10:10

mschilli

1,884
1
26
56

score 0 · Answer 8 · answered May 01 '16 at 00:08

0

awk '{gsub(/\$\$DATABASE_GIBSON\$\$/,"gibson")}1' file
gibson

test me gibson test me

gibson test gibson test

gibson gibsongibson

answered May 01 '16 at 00:08

Claes Wikner

1,457
1
9
8

score -1 · Answer 9 · answered Oct 25 '12 at 17:22

-1

echo $$DATABASE_WOOLY$$ | awk '{print tolower($0)}'

awk will take what ever input, in this case the first agurment, and use the tolower function and return the results.

For your bash script you can do something like this and use the variable DBLOWER

DBLOWER=$(echo $$DATABASE_WOOLY$$ | awk '{print tolower($0)}');

answered Oct 25 '12 at 17:22

Adam

510
1
5
21

This is not replacing `$$DATABASE_*$$` by `*` as requested by the OP. Also it will convert *all* the input to lower case. – mschilli Jul 07 '15 at 17:26

Replace string with substring in lowercase using sed / awk / tr / perl?

9 Answers9