0

I have a dataframe with names I have a second dataframe with a dictionary of names and the sex of those names I want to check if the name is in the dictionary if it is then add the sex from the dictionary table to the names dataframe

My code looks like below

# Sets everything to -1, 1 for male, 0 for female
train$sex <- "-1"
train$sex[toupper(train$fname) == nam_dict$Name]<-nam_dict$Sex

I am getting the following error

Error in train$sex[toupper(train$fname) == nam_dict$Name] <- nam_dict$Sex : 
NAs are not allowed in subscripted assignments
In addition: Warning message:
In toupper(train$fname) == nam_dict$Name :
longer object length is not a multiple of shorter object length

I have a work around - i think - where i can split the dictionary into male and female and simply replace the <-nam_dict$Sex portion of the code with the character 'F' or 'M' depending on the dictionary

I just thought there would be a better way

smci
  • 32,567
  • 20
  • 113
  • 146
John Smith
  • 2,448
  • 7
  • 54
  • 78
  • 1
    First of all its not clear if `train` and `nam_dict` of the same size, but even if they were, you are subsetting `train$sex` by condition and inserting `nam_dict$Sex` into without subsetting it too. Also, the message `NAs are not allowed in subscripted assignments` usually means that your subset operation returned `NA`s and R not sure where it should insert the data to. Either way, this is not reproducible as we don't have both data sets so I doubt anyone can help you really. – David Arenburg May 02 '15 at 19:23
  • Hi David, train and nam_dict are different sizes. The dictionary can be gotten from the link [here](http://www.autohotkey.com/board/topic/20260-gender-verification-by-forename-cmd-line-tool-db/). I took the file nam_dict.txt from it. The train dataset can be gotten from [here](https://www.ssb.no/en/befolkning/statistikker/navn/aar/2015-01-27?fane=tabell&sort=nummer&tabell=215662). Its from the central statistics office in Norway – John Smith May 02 '15 at 19:30
  • 2
    I'm certainly not going to download these data sets (someone maybe will though). I only can help if you will provide a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). I also think that within the process of making this example you will understand what's wrong with your code. Wish you luck either way. – David Arenburg May 02 '15 at 19:40
  • Thanks @David, that seems fair enough. You mentioned in your first post that i wasnt subsetting my nam_dict$Sex, Could you tell me how to do this. In essence if the first names match eg row 1 column 1 of the nam_dict is equal to the train data, i want to add row 1 column 2 of the nam_dict to the train data. Again thank you for your time. I will have a look at the NAs. It might be because of recycling or the dictionary might not have been cleaned correctly – John Smith May 02 '15 at 19:45
  • Something like `train$sex[toupper(train$fname) == nam_dict$Name] <- nam_dict$Sex[toupper(train$fname) == nam_dict$Name]` probably, but it seems like you additional error there (which I also mentioned). Though this code doesn't make much sense to me if `train` and `nam_dict` are of different sizes. Try just doing `toupper(train$fname) == nam_dict$Name` and see what happens. I think you are doing it all wrong. – David Arenburg May 02 '15 at 19:49
  • Yep, i have found the reason for the NA message, it appears an extra row was inserted because i imported it from excel and column one contained a comma i hadn't noticed. One down. How would do the comparison if my way seems clumsy - something like rbind? Again, genuinely thanks for your help – John Smith May 02 '15 at 20:13
  • Take a look at `match` or maybe `merge` – David Arenburg May 02 '15 at 20:16
  • Ok, nvm, I'll write something for you in a minute just because you are very polite and because I want to teach you how to provide a reproducible example in R – David Arenburg May 02 '15 at 20:23
  • In case you don't know, this is called a ***join*** or ***merge*** operation, like in SQL. Specifically, **LEFT JOIN**. In this case your slight twist is you need to perform a case-insensitive match on the name column, and some names may still no-match anyway (not found in the namelist), in which case the output should keep that name with NA in the sex column. – smci May 02 '15 at 23:14

1 Answers1

1

So your problem can be solved by a quick and simple implementation of match. First, here's a quick reproducible example

(train <- data.frame(fname = c("Alex", "Jennifer", "David", "Alice")))
#      fname
# 1     Alex
# 2 Jennifer
# 3    David
# 4    Alice
(nam_dict <- data.frame(Name = c("alice", "alex"), Sex = 0:1))
#    Name Sex
# 1 alice   0
# 2  alex   1

A possible solution

train$sex <- nam_dict$Sex[match(tolower(train$fname), tolower(nam_dict$Name))]
train
#      fname sex
# 1     Alex   1
# 2 Jennifer  NA
# 3    David  NA
# 4    Alice   0

Al bit more advanced solution (if you''ll ever work with huge data sets) is to try data.tables binary join which allows you to update your data by reference, chose which columns to update while everything is done in a lightning speed.

First we will convert both data sets to data.table class and make both columns lower case, then we will key them by the column we want join by, finally we do a left join to train while creating the sex column by reference and pulling the data from i (Sex column in nam_dict)

library(data.table)
setDT(train)[, fname := tolower(fname)]
setDT(nam_dict)[, Name := tolower(Name)]
setkey(train, fname) ; setkey(nam_dict, Name)
train[nam_dict, sex := i.Sex]
train
#       fname sex
# 1:     alex   1
# 2:    alice   0
# 3:    david  NA
# 4: jennifer  NA
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
  • 1
    Sorry for the delay in getting back to you - Timezones - this is exactly what i was looking for. Thank you for the answer and especially the huge data sets (next step). I have also noted the re-produceable method example and will use this method from now on when asking questions – John Smith May 03 '15 at 08:04