3

I have the following data:

                                                 Name
1                             Braund, Mr. Owen Harris
2 Cumings, Mrs. John Bradley (Florence Briggs Thayer)
3                              Heikkinen, Miss. Laina
4        Futrelle, Mrs. Jacques Heath (Lily May Peel)
5                            Allen, Mr. William Henry

The data can be loaded like:

structure(list(Name = c("Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Florence Briggs Thayer)", 
"Heikkinen, Miss. Laina", "Futrelle, Mrs. Jacques Heath (Lily May Peel)", 
"Allen, Mr. William Henry")), .Names = "Name", row.names = c(NA, 
-5L), class = c("tbl_df", "tbl", "data.frame"))

My expected output is:

                                                 Name    Title
1                             Braund, Mr. Owen Harris       Mr
2 Cumings, Mrs. John Bradley (Florence Briggs Thayer)      Mrs
3                              Heikkinen, Miss. Laina      Mrs
4        Futrelle, Mrs. Jacques Heath (Lily May Peel)      Mrs
5                            Allen, Mr. William Henry       Mr

The problem is that below code would set all Titles to just "Mr". I'm using a custom function with dplyr's mutate.

library('stringr')
library('dplyr')

extractTitle <- function(name) {
  str_match(name, '(\\b[a-zA-z]+)\\.')[2]
}

data <- data %>% 
          mutate(Title = extractTitle(Name))

The weird thing is that if I change extractTitle to return the argument as is, it works as expected. For example:

extractTitle <- function(name) {
  name
}

data <- data %>% 
          mutate(Title = extractTitle(Name))

The above code will return:

                                                 Name    Title
1                             Braund, Mr. Owen Harris   Braund, Mr. Owen Harris
2 Cumings, Mrs. John Bradley (Florence Briggs Thayer)   Cumings, Mrs. John Bradley (Florence Briggs Thayer)
3                              Heikkinen, Miss. Laina   Heikkinen, Miss. Laina
4        Futrelle, Mrs. Jacques Heath (Lily May Peel)   Futrelle, Mrs. Jacques Heath (Lily May Peel)
5                            Allen, Mr. William Henry   Allen, Mr. William Henry

This is my expected behavior which is different from the behavior of the code I'm having trouble with.

Is there something I'm missing here or is this a bug?

P.S. - I'm using dplyr version 0.5.0

Gjaldon
  • 5,534
  • 24
  • 32
  • What is the input? What is the expected output? It is mostly better to include a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610) – Jaap Jul 16 '16 at 11:27
  • @ProcrastinatusMaximus I just added the input and expected output as you suggested. This is my first time posting a question for R here in SO. – Gjaldon Jul 16 '16 at 12:10

1 Answers1

2
library(dplyr)
library(stringr)    
data %>%
      mutate(title = str_extract(string = Name, pattern = "(Mr|Miss|Mrs)\\.")) %>%
      select(Name, title)

which returns:

# A tibble: 6 x 2
                                                 Name title
                                                <chr> <chr>
1                             Braund, Mr. Owen Harris   Mr.
2 Cumings, Mrs. John Bradley (Florence Briggs Thayer)  Mrs.
3                              Heikkinen, Miss. Laina Miss.
4        Futrelle, Mrs. Jacques Heath (Lily May Peel)  Mrs.
5                            Allen, Mr. William Henry   Mr.
6                                    Moran, Mr. James   Mr.
Maiasaura
  • 32,226
  • 27
  • 104
  • 108
  • 1
    Thanks a lot! Using `str_exact` fixes it! Looks like the problem was with using the `str_match` function. Any idea why? – Gjaldon Jul 16 '16 at 12:49