1

I just recently migrated from STATA to R. I am quite excited but obviously now face all these teething problems of getting used to R and the way it works.

In my work I have been using STATA for data cleaning as well. And now I intent to do the same with R. We collect lots of primary data (household level data) using questionnaires. The data sets are quite big and often consists of a number of different grids (household grids, for example, where we repeat a number of questions for each household member such as what is your name, your age, your education etc.). So you might end up with questions such as Q3_Name_1 (name for the household member 1) Q3_Age_1 (age of household member 1) Q3_Edu_1 (education of household member 1)

Q3_Name_2 (name for the household member 2)
Q3_Age_2 (age of household member 2)
Q3_Edu_2 (education of household member 2)

Q3_Name_3 (name for the household member 3)
Q3_Age_3 (age of household member 3)
Q3_Edu_3 (education of household member 3)

The structure of the data frame more or less looks as follows:

df <- data.frame(ID=c(1, 2, 3,4), Q3_A_1=c(1, 3, 2, 5), 
                 Q3_Age_2=c(1, 4, 2, "Refused"), 
                 Q3_Age_3=c(1, 9, 2, 4), 
                 Q3_Age_4=c(1,11, "Don't know", 5), stringsAsFactors=F)

If I need to make changes to one question in a household grid (for example to the question what is your age), I most likely need to make similar changes to all other name-questions in the household grid. Here you best use a loop. You write the commands once and then R applies to all these questions in the grid.

I tried and failed. I still don’t get the principles that govern these loops in R. in STATA, the key piece is a placeholder, something like ‘i’ to replace the numbers at the end of each question name. What is the equivalent in R? I tied to crack the nut as follows:

i<-1; while(i<=11){
w3$Q3Age_i<-as.character(w3$Q3Age_i)
   w3$Q3Age_i[w3$Q3Age_i == "Refused" | w3$Q3Age_i == "Don't Know"] <-   "NA"
   w3$Q3Age_i<-as.numeric(w3$Q3Age_i)
   i<-i+1
}

Maybe you can also use ‘repeat’ ot the like. But at this stage, I just don’t understand how you can mae R understand that w3$Q3Age_i first refers to w3$Q3Age_1 and then to w3$Q3Age_2 etc.

Any help or hints would much appreciated!

Best,

Dom

DomB
  • 217
  • 1
  • 4
  • 14
  • Look into using the `apply` functions. – Tim Biegeleisen Jan 09 '16 at 08:40
  • You should give an example of your data (through for instance the output of `dput(w3)`) and state your desired output. – nicola Jan 09 '16 at 08:48
  • I started writing an answer, but I think your data is probably where the problem lies. It would be much easier to solve if your data was in a classic database format, e.g. columns of `Question`, `Household`, `Respondent`, which combined would make a unique ID. I'll put something together based on this. – MikeRSpencer Jan 09 '16 at 09:01
  • Thanks everyone for the first hints. I am still a bit confused but will look into the first suggestions. I edited my first question to include an example of the structure of the data frame. I hope that makes it more clear, cheers Dom – DomB Jan 09 '16 at 18:33
  • You can loop over columns using `apply`, but I still think you're best making your data into standard database long format, e.g. . – MikeRSpencer Jan 09 '16 at 18:42

1 Answers1

1

The example you give doesn't need a loop as R vectorises operations. If we make some dummy data:

df = data.frame(ID=c(1, 2, 3, 4),
                Q3_Age_1=c(1, 3, 2, 5), 
                Q3_Age_2=c(1, 4, 2, "Refused"), 
                Q3_Age_3=c(1, 9, 2, 4), 
                Q3_Age_4=c(1, 11, "Don't know", 5),
                stringsAsFactors=F)

We can then change one column:

df$ans[df$Q3_Age_2 == "Refused"] = NA

Which is not ideal, given the current data structure.

You can use the reshape2 package (http://seananderson.ca/2013/10/19/reshape.html) to transform your data from wide to long:

library(reshape2)
df = melt(df, id.vars="ID")

Which means you can now operate on your answers column more easily.

However, if you want to do more complex things and do need a loop I think it's key to better define your index column(s).

Get index columns:

x = strsplit(df$ind, "_")
x = do.call("rbind.data.frame", x)
colnames(x) = c("Question", "Household", "Respondent")
df = cbind(x, df)

Iterate over your data:

# Note lapply returns a list
lapply(unique(df$Respondent), function(i){
    x = df[df$Respondent, ] = i
    # Do what you need to x, your subset of df
    # Make sure your last line returns your result
})

You may also want to look at group_by from the dplyr package.

Finally, look at how to make a repeatable example (How to make a great R reproducible example?) as I suspect your data structure is pretty odd (and is perhaps best stored in a list)!

Community
  • 1
  • 1
MikeRSpencer
  • 1,276
  • 10
  • 24
  • Thanks Mike. What I failed to point out is that i want to make changes over a set of different variables (different columns in my data frame). These different columns all have the same variable name but a counter at the end. I edited your first example to make it more clear of how my data is structured. I hope that makes it more clear. Would you still proceed as in your second example? Thanks in advance! Dom – DomB Jan 09 '16 at 18:35
  • My suggestion would be to have a look at different data structures, as I think that's where the problem lies. From the information I have I'd have two index columns (household & respondent) and then question columns, given each question is likely to have a different data type. – MikeRSpencer Jan 09 '16 at 18:51
  • Thanks for the hints. Will look into it immediately. BTW, I solved my problem as follows: i <- grep("Q3_Age_", colnames(df)) and then next line df[i][df[i]=="Refused" | df[i]=="Don't know"]<-999! Thanks for all the help – DomB Jan 10 '16 at 09:17