Using a string distance technique to create a factor variable in R

Question

I am a new R enthusiast working on expanding my knowledge. I am reading the An Introduction To Data Cleaning With R article by Edwin de Jonge and Mark van der Loo. I am working on exercise 2.4 and I would appreciate it if someone could confirm my technique in solving this specific problem: This is the original data:

1 // Survey data. Created : 21 May 2013
2 // Field 1: Gender
3 // Field 2: Age (in years)
4 // Field 3: Weight (in kg)
5 M;28;81.3
6 male;45;
7 Female;17;57,2
8 fem.;64;62.8

This is a cleaner version that I was able to construct:

df:  
Gender Age..in.years. Weight..in.kg.
1      M             28           81.3
2   male             45           <NA>
3 Female             17           57,2
4   fem.             64           62.8

Now this is what I get from recoding using adist

D:
rawtext  coded
1       M   male
2    male   male
3  Female female
4    fem. female

Now I have to transform the Gender column into a factor variable with labels man and woman. I have no idea how to proceed and I am thinking of changing the gender column of the data to the following column vector:

    f <- factor(D$coded, levels = c("male", "female"), labels = c("man", "woman"))

which returns:

    [1] man   man   woman woman
    Levels: man woman

Am I correct or plain wrong?; Is there a way to use transform to directly change the Gender variable in df? i.e. is it better to do:

df$Gender <- plyr::revalue(D$coded, c(male = "man", female = "woman"))

Or is there another way to change the observations of the Gender variable to "man" or "woman" without using multiple ifesle commands?

I am trying to get answers by learning more about factors but nothing quite similar to this pops up anywhere. Thanks.

You should display data so that it is easily reproducible. Here's a reference: http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example — Frank, May 01 '15 at 18:19
Could you display the results of `D$coded`? The level ordering makes a difference there — David Robinson, May 01 '15 at 18:19
Sorry I am not much of an expert on StackOverFlow either. The D data is the 3rd code block. — Buckeye14Guy, May 01 '15 at 18:22
@Buckeye14Guy Look at Frank's link, it has *lots* of good advice for asking good questions. One of the best things you can do is share your data using `dput`. — Gregor Thomas, May 01 '15 at 18:23
I am not clear what question you are actually asking here? Maybe you can highlight the part you are having trouble with and introduce or end the question with what exactly it is that is not working for you? — Kmeixner, May 01 '15 at 18:27

score 1 · Accepted Answer · answered May 01 '15 at 18:24

1

The line

f <- factor(D$coded, levels = c("male", "female"), labels = c("man", "woman"))

did work, but only because you got lucky- that is to say, because D$coded's levels were in the order c("male", "female"). If they'd been in a different order, the man and woman labels would have been transposed in your new factor. (After all, you never specify in that line which level should go to "male" and which to "female"!)

When revaluing levels of a factor, it's safer and simpler to use the revalue function from the plyr package:

f <- plyr::revalue(D$coded, c(male = "man", female = "woman"))

answered May 01 '15 at 18:24

David Robinson

77,383
16
167
187

Great that does work! Thanks. And I did run into a problem where the vector was and that is why I introduced the levels argument in f to keep that order. Would there be a way to use transform(gender = factor(df, levels = ..., labels = ...)) where df is the data in the second code block? – Buckeye14Guy May 01 '15 at 18:28
@Buckeye14Guy Second code block, no, you need to use `adist` to get from the second code block to the third. But you said you already did so – David Robinson May 01 '15 at 19:04

score -1 · Answer 2 · edited Sep 27 '16 at 18:47

-1

using base R:

levels(f) <- list(man = "male", woman = "female")

edited Sep 27 '16 at 18:47

Karolis Koncevičius

9,417
9
56
89

answered Sep 27 '16 at 12:00

Anthony Simon Mielniczuk

333
3
10

Using a string distance technique to create a factor variable in R

2 Answers2