3

How can I capitalize first letter of each word except certain words

x <- c('I like the pizza', 'The water in the pool')

I expect the output to be

c('I Like the Pizza', 'The Water in the Pool')

Currently I am using

gsub('(^|[[:space:]])([[:alpha:]])', '\\1\\U\\2', x, perl=T) 

Which capitalizes the first letter of each word.

Tushar
  • 85,780
  • 21
  • 159
  • 179
imsc
  • 7,492
  • 7
  • 47
  • 69

3 Answers3

3

You can apply a blacklisting approach with a PCRE RegEx:

(?<!^)\b(?:the|an?|[io]n|at|with|from)\b(*SKIP)(*FAIL)|\b(\pL)

This is a demo of what this regex matches.

In R:

x <- c('I like the pizza', 'The water in the pool', 'the water in the pool')
gsub("(?<!^)\\b(?:the|an?|[io]n|at|with(?:out)?|from|for|and|but|n?or|yet|[st]o|around|by|after|along|from|of)\\b(*SKIP)(*FAIL)|\\b(\\pL)", "\\U\\1", x, perl=T)
## => [1] "I Like the Pizza"      "The Water in the Pool" "The Water in the Pool"

See IDEONE demo

Here is an article Words Which Should Not Be Capitalized in a Title with some hints on what words to include into the first alternative group.

The RegEx explanation:

  • (?<!^) - only match the following alternatives if not at the start of a string (I added this restriction as in comments, there is a requirment that the first letter should always be capitalized.)
  • \b - a leading word boundary
  • (?:the|an?|[io]n|at|with(?:out)?|from|for|and|but|n?or|yet|[st]o|around|by|after|along|from|of) - the whitelist of the function words (CAN AND SHOULD BE EXTENDED!)
  • \b - trailing word boundary
  • (*SKIP)(*FAIL) - fail the match once the function word is matched
  • | - or...
  • \b(\pL) - Capture group 1 matching a letter that is a starting letter in the word.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Do you have some (good) literature on `(*SKIP)(*FAIL)` at hand? – Jan Jan 14 '16 at 09:20
  • @Jan: You can read on SO: [*How do (*SKIP) or (*F) work on regex?*](http://stackoverflow.com/questions/24534782/how-do-skip-or-f-work-on-regex). Also, there is a *`Variation for Perl, PCRE and Python: (*SKIP)(*FAIL)`* section at [*The Greatest Regex Trick Ever* rexegg.com page](http://www.rexegg.com/regex-best-trick.html). – Wiktor Stribiżew Jan 14 '16 at 09:24
  • Thanks. But doesn't work for `the water in the pool`. The first letter should always be capitalized. – imsc Jan 15 '16 at 11:08
  • Then you need a `(?<!^)` at the beginning. – Wiktor Stribiżew Jan 15 '16 at 11:13
  • thanks for your code is there I used it on French and German adresses. There I ran into a problem as special characters such as `ä,ö,ü,è,é,à` are interpreted as the beginning of words and hence are capitalized in the middle of words. How can I prevent that behaviour of the function and still have the first letter always capitalized? – Oscar Thees Oct 20 '21 at 18:16
  • @OscarThees Add `(*UCP)` at the start of the pattern to make `\b` and other shorthands Unicode-aware. – Wiktor Stribiżew Oct 20 '21 at 19:01
  • @WiktorStribiżew Thanks a lot! – Oscar Thees Oct 20 '21 at 19:14
2

The following regex achieves what you are trying to do:

\b(?!(?:in|the|of)\b)([a-z])
# look for a word boundary on the left
# assure that in/the/of is not following immediately 
# (including word boundary, thanks to @stribizhev)
# match and capture a lowercase letter

These matched letters (in group 1) need to be changed to Uppercase letters. See a working demo on regex101.

In R:

sapply(x, gsub, pattern = "\\b(?!(?:in|the|of)\\b)([a-z])", replacement = "\\U\\1", 
  perl = TRUE, USE.NAMES = FALSE)
## [1] "I Like the Pizza"      "The Water in the Pool"
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
Jan
  • 42,290
  • 8
  • 54
  • 79
1

I am not good with regex, so found an alternative. d is a vector of words which needs to be excluded.

We split the string into words using strsplit and then check if any of the word matches with the vector d, if it doesn't then we capitalize it using the capitalize function in the Hmisc package.

library(Hmisc)
x <- c('I like the pizza', 'The water in the pool')
d <- c("the","of","in")
lapply(strsplit(x, " "), function(x) ifelse(is.na(match(x, d)), capitalize(x),x))

# [[1]]
#[1] "I"     "Like"  "the"   "Pizza"

#[[2]]
#[1] "The"   "Water" "in"    "the"   "Pool" 

Further you can use sapply along with paste to get it back as vector of string

a <- lapply(strsplit(x, " "), function(x) ifelse(is.na(match(x, d)), capitalize(x),x))
sapply(a, function(x) paste(x, collapse = ' '))

#[1] "I Like the Pizza"      "The Water in the Pool"
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213