2

I have the following code

titanic <- titanic %>% mutate(title = ifelse(str_detect(name,"Mr.|Ms.|Mme."), "Hombre casado",
                                    ifelse(str_detect(name, "Master."), "Hombre soltero",
                                    ifelse(str_detect(name, "Miss."), "Mujer soltera",

                                    ifelse(str_detect(name, "Mrs.|Mlle."), "Mujer casada",  "Otro")))))

I have the following dataframe:

name <- c("Mr Sergio", "Mrs Maria")
surname <- c("Nnunci", "Gonzalez")

df <- data.frame(name, surname)

The idea of this function is to add to the title column their marital status depending if in name column there is Mr Ms or Mrs.

For example, if in the column "name" I have Mr|Ms|Mme (one of them), then in title I have to put "Hombre casado" which mean "Married man".

It's working well except for "Mrs." which mean "Married women", because when I apply this function to my dataset, Married women appear as "Hombre casado" (Married man). I think it's about the pattern I am using for detecting the types.

Output:

    name      surname        title
    Mr Sergio  Nnunci     Hombre Casado
    Mrs Maria  Gonzalez   Mujer Casada

Some idea?

Marcus Campbell
  • 2,746
  • 4
  • 22
  • 36
  • 1
    You should wrap the patterns in `fixed()`, like `str_detect(name, fixed("Mr.|Ms.|Mme."))`. If you don't do that the patterns is evaluated as a regular expression in which a `.` means *"every character"* – Jaap Feb 08 '18 at 16:27
  • 1
    the period is a wild card character in regards to regular expression. Try escaping the `.` with `\\.` – Dave2e Feb 08 '18 at 16:28
  • 1
    Two ideas: 1. you could structure your code so that the "Mrs" checks come *before* the "Mr" checks. 2. (better) you could use the regex special `\b` that matches *word boundaries* (white space, punctuation, line ends), making your patterns more like `"Mr\\b|Ms\\b|Mme\\b"`. Also heed Dave's advice - in regex a `.` will match any one character. If you want to literally match a `.` you need to escape it. However, in this case I think you are better off using word boundaries. – Gregor Thomas Feb 08 '18 at 16:28
  • Thanks all of u finally I used what u said about word boundaries!!! Thanks all!!! – Sergio Urrea González Feb 08 '18 at 16:36
  • As I side comment, I don't know which version of titanic data you are using, but I would assume `Ms.` is equivalent to Miss, `Mlle` for Mademoiselle, also equivalent to Miss, and `Mme` to be Madame, equivalent to Mrs. – Gregor Thomas Feb 08 '18 at 16:38

1 Answers1

0

One way to do what you're wanting to do is to look for words which are ended by periods, because, as luck would have it, all of the titles end with periods:

library(tidyverse)

titanic <- read_csv("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv")

titanic %>%
    mutate(Title = str_extract(Name, "\\w+(?=\\.)")) %>%
    rowwise() %>%
    mutate(
            Title = switch(Title, 
                        Master = "Hombre soltero",
                        Miss = "Mujer soltera",
                        Mr = "Hombre casado",
                        Mrs = "Mujer casada",
                        Mme = "Mujer casada",
                        Ms = "Mujer soltera",
                        Mlle = "Mujer soltera",
                        "Otros"))   %>%
            select(Name, Title) %>%
            distinct()
# Otros includes Don, Rev, Dr, Major, Lady, Sir, Col, Capt, Jonkheer

For a more general solution to the question of 'What do I do if I want to match "Mr without matching "Mrs"?', you can use negative lookaheads:

str = "Futrelle, Mrs. Jacques Heath (Lily May Peel)"

# search for "Mr", but not "Mrs", by specifying that the "Mr" cannot be followed by an "s"
str_extract(str, "Mr(?!s)")
# [1] NA
Mark
  • 7,785
  • 2
  • 14
  • 34