0

To replace whole words with sed, one does:

$ echo "bar embarassment" | sed "s/\bbar\b/no bar/g"
no bar embarassment

This is taken from another stackoverflow question. Follow on question, how do I change the definition of a word?

From linuxtopia:

GNU sed, ssed, sed16, sed15 and sedmod use certain symbols to define the boundary between a "word character" and a nonword character. A word character fits the regex "[A-Za-z0-9_]".

How does one include e.g. "-"? Or in my particular case, I want to rename variables in a R codebase, where they are littered with "." (it is often used instead of "_" in variable names, see for example google's R styleguide), so I would like to include "." in the definition of a word.

EDIT:

To be extra clear, say I want to change current.my.date <- my.date + today into current.my.date <- any.date + date, what is the sed command?

e.g. fix this command

echo "current.my.date <- my.date + today" | sed "s/\bmy.date\b/any.date/g"
current.any.date <- any.date + today

Because in its current form it also changes current.my.date

Community
  • 1
  • 1
Cookie
  • 12,004
  • 13
  • 54
  • 83

2 Answers2

2

Try this:

$ echo "current.my.date <- my.date + today" |
    sed -r 's/(^|[^[:alnum:]_.])my\.date([^[:alnum:]_.]|$)/\1any.date\2/g'
current.my.date <- any.date + today

It assumes that a "word" is a sequence of "alpha-numeric or _ or ." characters separated by characters outside of that set or preceded by a start-of-string (^) or succeeded by an end-of-string ($).

If that's not what you want, post more sample input and expected output.

Sound like you need some variation of this:

awk '{
    head = ""
    tail = $0
    while( match( tail, /(^|[^[:alnum:]_.])my\.date([^[:alnum:]_.]|$)/ ) ) {
        head = head substr(tail,1,RSTART-1) "any.date"
        tail = substr(tail,RSTART+RLENGTH-1)
    }
    print head tail
}' file

to get what you want.

Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • 1
    +1 for using pure POSIX regex, let's hope this suites the OP. – anubhava Apr 18 '14 at 14:07
  • Have been using this for a while, and working well enough, but as feared there are still cases where it doesn't work. Currently it fails on `Var=Var`, e.g. when there is only 1 letter between 2 instances to be renamed. It is a shame there is no way to actually replicate `\b` behaviour and to amend it in there. – Cookie Apr 20 '14 at 12:21
  • Yes, that's true, the separator character will only be "seen" for one of the matches. You need to switch to awk - I posted a snippet as your starting point, it's not exactly what you need but hopefully you get the idea and can modify it to suit. – Ed Morton Apr 20 '14 at 13:55
1

so I would like to include "." in the definition of a word

You can use this character class:

[A-Za-z0-9_.]

If you want to add hyphen also then use:

[A-Za-z0-9_.-]

Also remember with these additions you cannot reply on \b as word boundary since hyphen and dot are also considered word boundary. You can use negated character class for that case:

[^A-Za-z0-9_.-]

EDIT:

echo "foo-bar embarassment" | sed "s/\([A-Za-z0-9_.-]\+\)/no \1/g"
no foo-bar no embarassment
anubhava
  • 761,203
  • 64
  • 569
  • 643
  • I am sorry what command you need? You just asked how to include dot and hyphen in your question. Provide some sample input and I will assist in sed command also. – anubhava Apr 18 '14 at 10:02
  • What does this command `echo "bar embarassment" | sed "s/\bbar\b/no bar/g"` become? How do I replace only whole words which can contain dots? – Cookie Apr 18 '14 at 10:04
  • Sorry that doesn't work for me. I edited my question to include an example. – Cookie Apr 18 '14 at 10:21
  • That will be: **`echo "current.my.date <- my.date + today" | sed -r "s/(^| )(my\.date)( |$)/\1any.date\3/g"`** – anubhava Apr 18 '14 at 10:26
  • Might this be closest? `echo "my.date" | sed -r "s/([^A-Za-z0-9_.]|^)(my\.date)([^A-Za-z0-9_.]|$)/\1any.date\3/g"`? But I am wondering whether there isn't a way to see what `\b` actually implements and to modify its behaviour? – Cookie Apr 18 '14 at 10:42
  • You cannot change definition of `\b` you can just use alternatives. – anubhava Apr 18 '14 at 10:42
  • Instead of `\b` you can use `(^| )` as I showed OR else use `([^A-Za-z0-9_.-])` but meaning of `\b` cannot be changed – anubhava Apr 18 '14 at 10:43
  • `(^| )` really doesn't work because it doesn't respect brackets etc. Neither does `([^A-Za-z0-9_.-])` work because it doesn't respect beginning of line. There are just a lot of errors here, and I am scared we are missing some cases, even when using `([^A-Za-z0-9_.]|^)` as I currently am. – Cookie Apr 18 '14 at 10:49
  • You need to clear your confusions first about regex. A clearly explained problem with good examples attract many more answers usually on SO. You never mentioned about brackets before. To get right answer in one shot first you need to edit your question and show **each and every case** that you want to treat as word boundary. Changing/clarifying requirements via comments is not a good idea. – anubhava Apr 18 '14 at 10:52
  • Okay. I have a very large codebase. There are lots of cases. I can't post them all. I want to rename variables. Usually I use `\b` for that, but it doesn't allow for dots. So I want the same behaviour but including dots. Unfortunately that means I can't list every case in an example. – Cookie Apr 18 '14 at 11:06
  • Clearly DOT is not the only case since you are also commenting about hyphen or bracket somewhere. **`\b` is equivalent of `[^A-Za-z0-9_]`** but it doesn't grab the text. You can add your chosen character in this class to extend than definition to make it `([^A-Za-z0-9_.-])` for example. What you need to define is your expected definition of `\b` in question. – anubhava Apr 18 '14 at 11:17
  • That is not correct. \b does all whole words. It also accommodates beginning of line, end of line, and I don't know what else. It uses above sequence to explain how it defines words, nothing else. – Cookie Apr 18 '14 at 11:26
  • Yes `\b` does match at line start and end also but don't know what you mean by `\b does all whole words` – anubhava Apr 18 '14 at 11:27
  • `[^A-Za-z0-9_]` does not match beginning of line, `\b` does, so they are not equivalent in my eyes. I think my expected behaviour is very clear - it is in the title. I want all *whole* words, including `.` in the definition of a word. – Cookie Apr 18 '14 at 11:53
  • 1
    @Cookie - you don't need to show ALL of your possible inputs, but clearly showing just 1 is not working out for you. How about posting, say, 10 of the examples you think are hard to handle along with the associated expected output? Anubhava - +1 for sticking with it :-)! – Ed Morton Apr 18 '14 at 13:22
  • 1
    Thanks @EdMorton. That is what I tried to explain to OP. In most situations `(^| )` and `( |$)` would have been good enough if OP want to add DOT, hyphen, brackets etc as part of word. – anubhava Apr 18 '14 at 14:06
  • @anubhava: Sorry this is getting confusing. I do not want to add lots of things. I just want to add dots to the word. I specifically don't want to add brackets. However, above version does treat brackets as breaking the word. So this has nothing to do with it. – Cookie Apr 18 '14 at 16:38
  • @EdMorton: I really am struggling here. I am precisely asking this question because I want a generally applicable way of modifying the `\b` behaviour to allow for different word delimiters - this is not a question that can have a list of exhaustive examples. If you look at the original question referenced, that was exactly how it was asked: "how do I replace whole words". There was no exhaustive list of examples. It got a correct and succinct answer, lots of upvotes and views. Admittedly the subset of ppl applying this to other codebases such as R will be smaller, but nonetheless relevant. – Cookie Apr 18 '14 at 16:40
  • We have even suggested you to provide enough examples to make yourself clear. If you just want to add dot in word definition then my very first suggestion: `[A-Za-z0-9_.]\+` will be enough for you. – anubhava Apr 18 '14 at 17:04
  • @Cookie - you have your answer then, right? There is no way of modifying the meaning of `\b` to include `.` or any other character, and the alternative is to use `s/(^|[^[:alnum:]_.])foo([^[:alnum:]_.]|$)/\1bar\2/` instead of `s/\bfoo\b/bar/`. The original question was a very simple one with a very simple answer. Your question is neither. – Ed Morton Apr 18 '14 at 17:09