11

I'm an awk newbie, so please bear with me.

The goal is to change the case of a string such that the first letter of every word is uppercase and the remaining letters are lowercase. (To keep the example simple, "word" is defined here as strictly alphabetic characters; all others are considered separators.)

I learned a nice way to make the first letter of every word uppercase from another post on this website using the following awk command:

echo 'abce efgh ijkl mnop' | awk '{for (i=1;i <= NF;i++) {sub(".",substr(toupper($i),1,1),$i)} print}' --> Abcd Efgh Ijkl Mnop

Making the remaining letters lowercase is easily accomplished by preceding the awk command with a tr command:

echo 'aBcD EfGh ijkl MNOP' | tr [A-Z] [a-z] | awk '{for (i=1;i <= NF;i++) {sub(".",substr(toupper($i),1,1),$i)} print}' --> Abcd Efgh Ijkl Mnop

However, in the interest of learning more about awk, I wanted to change the case of all but the first letter to lowercase with a similar awk construct. I used the regular expression \B[A-Za-z]+ to match all letters of a word but the first, and the awk command substr(tolower($i),2) to provide those same letters in lowercase, as follows:

echo 'ABCD EFGH IJKL MNOP' | awk '{for (i=1;i <= NF;i++) {sub("\B[A-Za-z]+",substr(tolower($i),2),$i)} print}' --> Abcd EFGH IJKL MNOP

Notice that the first word converted properly, but the remaining words are left unchanged. I would be very grateful for an explanation of why the remaining words did not convert properly and how to get them to do so.

scolfax
  • 710
  • 2
  • 6
  • 17
  • you can find the solution [here](http://theunixshell.blogspot.com/2013/01/caplitalizechange-to-uppercase-first.html) – Vijay Jan 03 '13 at 13:40
  • Although I was trying to solve the problem with `awk`, thank you for the link to a nice `perl` solution. – scolfax Jan 03 '13 at 14:20

4 Answers4

10

The issue is that \B (zero-width non-word boundary) only seems to match at the beginning of the line, so $1 works but $2 and following fields do not match the regex, so they are not substituted and remain uppercase. Not sure why \B doesn't match except for the first field... B should match anywhere within any word:

echo 'ABCD EFGH IJKL MNOP' | awk '{for (i=1; i<=NF; ++i) { print match($i, /\B/); }}'
2   # \B matches ABCD at 2nd character as expected
0   # no match for EFGH
0   # no match for IJKL
0   # no match for MNOP

Anyway to achieve your result (capitalize only the first character of the line), you can operate on $0 (the whole line) instead of using a for loop:

echo 'ABCD EFGH IJKL MNOP' | awk '{print toupper(substr($0,1,1)) tolower(substr($0,2)) }'

Or if you still wanted to capitalize each word separately but with awk only:

awk '{for (i=1; i<=NF; ++i) { $i=toupper(substr($i,1,1)) tolower(substr($i,2)); } print }'
Anders Johansson
  • 3,926
  • 19
  • 19
  • Thank you for showing that `\B` unfortunately only operates at the beginning of the line. With your suggested construct without the `for` loop, the first word changed case as desired but the remainder became lowercase: Abcd efgh ijkl mnop – scolfax Jan 03 '13 at 14:01
  • @scolfax, yes at first I interpreted your request "I wanted to change the case of all but the first letter to lowercase" as all letters of the *line*. But I also provided a second construct below, in case you wanted to uppercase the first letter of each *word* instead. – Anders Johansson Jan 03 '13 at 14:05
  • That did it! Simpler than using the `sub` command. Thank you very much. I've been a `grep` and `sed` person for some time, but it's finally time to dive into `awk`! – scolfax Jan 03 '13 at 14:18
4

When matching regex using the sub() function or others (like gsub() etc), it's best used in the following form:

sub(/regex/, replacement, target)

This is different from what you have:

sub("regex", replacement, target)

So your command becomes:

awk '{ for (i=1;i<=NF;i++) sub(/\B\w+/, substr(tolower($i),2), $i) }1'

Results:

Abcd Efgh Ijkl Mnop

This article on String Functions maybe worth a read. HTH.


I should say that there are easier ways to accomplish what you want, for example using GNU sed:

sed -r 's/\B\w+/\L&/g'
Steve
  • 51,466
  • 13
  • 89
  • 103
  • The link you provided describes nicely why regex literals /.../ are much preferred over strings "...", and I will make that change henceforth. However, for some reason, I got the same result as before, with only the first word converting case: `echo 'ABCD EFGH IJKL MNOP' | awk '{for (i=1;i <= NF;i++) sub(/\B[A-Za-z]+/, substr(tolower($i),2),$i)}1'` --> Abcd EFGH IJKL MNOP. I wonder if this is due to an `awk` version difference (mine is OS X, which by the way doesn't support `\L` in the `sed` command, unfortunately). – scolfax Jan 03 '13 at 14:27
3

My solution will be to get the first part of the sub with a first substr insted of your regex :

echo 'ABCD EFGH IJKL MNOP' | awk '{for (i=1 ; i <= NF ; i++) {sub(substr($i,2),tolower(substr($i,2)),$i)} print }'
Abcd Efgh Ijkl Mnop
Pilou
  • 1,398
  • 13
  • 24
  • Using `substr($i,1)` itself as the regex is very creative. At first I thought I could run into problems if that sequence of characters was repeated later in the same word, but I believe `sub` matches only the first occurrence, so your solution should work fine. Thank you for this nice solution. – scolfax Jan 03 '13 at 14:41
  • Sorry, I meant `substr($i,2)`. And cancel my comment about a second occurrence, since ($i,2) consists of the entire word from the second character on. – scolfax Jan 03 '13 at 14:46
  • Actually, this construct does run into a problem if the 2nd-through-last characters match the 1st-through-(last - 1) characters: `echo 'AAAA' | awk '{for (i=1 ; i <= NF ; i++) {sub(substr($i,2),tolower(substr($i,2)),$i)} print }'` --> aaaA – scolfax Jan 03 '13 at 16:10
  • Effectively i thought about repeted sequence but not about the problem you pointed, i'll probably look about an other/improved solution. But you choose an answer so probably not immediatly... – Pilou Jan 03 '13 at 16:54
  • You can sub all the $i to be good, but that's end with an equivalent of the accepted solution : `echo 'ABCD EFGH IJKL MNOP AAAA' | awk '{for (i=1 ; i <= NF ; i++) {sub($i,substr($i,1,1)tolower(substr($i,2)),$i)} print }' Abcd Efgh Ijkl Mnop Aaaa` – Pilou Jan 03 '13 at 16:59
  • Thank you. That corrects the problem. (Being somewhat inexperienced, I chose an answer too quickly and learned from this to wait until everyone has had a chance to share their thoughts.) – scolfax Jan 03 '13 at 18:19
1

You have to add another \ character before \B

 echo 'ABCD EFGH IJKL MNOP' | awk '{for (i=1;i <= NF;i++)
 {sub("\\B[A-Za-z]+",substr(tolower($i),2),$i)} print}'

With just \B awk gave me this warning:

awk: cmd. line:1: warning: escape sequence \B' treated as plainB'

coelhudo
  • 4,710
  • 7
  • 38
  • 57
  • I understand the use of the second \ character in the regex string: to make sure it is interpreted as `\B` rather than an escaped `B`. However, for some reason, my version of `awk` (OS X) seemed to interpret it as `\B` anyway, and adding the second \ character didn't change the result. But thanks for that good reminder. – scolfax Jan 03 '13 at 14:12