0

I have a data frame DF which contains numerous variables. Each variable is present twice because I am conducting an analysis of "couples".

Among others, DF has a series of indicators of diversity :

 DF$div1.1, DF$div2.1, .... , DF$divN.1, DF$div.1.2, ..., DF$divN.2

Similarly, it has a series of indicators of another characteristic:

 DF$char1.1, DF$char2.1, .... , DF$charM.1, DF$char.1.2, ..., DF$charM.2

Here's a link to an example of DF: http://shorttext.com/5d90dd64

Each time the ".1", ".2" stand for the couple member considered.

My goal: For each indicator divI and charJ, I want to create another variable DF$divchar that takes the value DF$divI.1 when DF$charJ.1>DF$charJ.2; and DF$divI.2 when DF$charJ.1<DF$charJ.2.

Here is the solution I came up with, it seems somehow very intricate and sometimes behaves in strange ways:

  1. I created a series of binary variables that take the value one if DF$charJ.1>DF$charJ.2. The are stored under DF$CharMax.1. Here's how I created it:

    DF$CharMax.1 <- as.data.frame(
        sapply(1:length(nam), 
            function(n) 
            as.numeric(DF[names(DF)==names.1[n]] 
            >DF[names(DF)==names.2[n]])
                ))
    
  2. I created the function BinaryExtract:

    BinaryExtract <- function(var1, var2, extract) {var1*extract +var2*(1-extract)}
    
  3. I created the matrix NameFull that contains all the possible combinations of div and char, separated with "YY"

    NameFull <- sapply(c("div1",...,"divN")
        , function(nam) paste(nam, names(DF$YMax.1), sep="YY")
    
  4. And then I create all my variables:

     DF[, as.vector(NameFull)] <-   lapply(as.vector(NameFull),   function(e) 
        BinaryExtract(DF[,paste0(unlist(strsplit(e,"YY"))[1],".1")] 
        , DF[, paste0(unlist(strsplit(e,"YY"))[1],".1")]
        , DF$charMax.1[unlist(strsplit(e,"YY"))[2]]))       
    

My Problem

A. It looks like a very complicated solution for something that simple. What am I missing?

B. Moreover, when I print DF, just typing DF in the command window, I do not see the variables NameFull. They seem to appear with the names of char. Here's what I get: http://shorttext.com/5d9102c

Similarly, I have tried to change all their names to get rid of the "YY" and it does not seem to work:

   names(DF[, as.vector(NameFull)]) <- as.vector(c("div1",...,"divN"), sapply(, function(nam) 
    paste(nam, names(DF$YMax.1), sep=".")))

When I look at names(DF), I keep getting the old names with the "YY"

However, I do get a result if I explicitly call for them

 > DF[,"divIYYcharJ"]

I would really appreciate any suggestion, comment and explanation. I am quite new to R ad was more used to Stata. I feel there is something deeply inefficient here. Thanks

Doon_Bogan
  • 359
  • 5
  • 17
  • 3
    [reproducible example reproducible example reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) – Rich Scriven Oct 10 '14 at 17:30
  • 2
    Just transform your data.frame to long format. The problem becomes very simple then. You should read [Tidy Data](http://vita.had.co.nz/papers/tidy-data.pdf). – Roland Oct 10 '14 at 17:31
  • `apply(df, 1, function(x) {sapply(seq(1,length(x),2), function(x) {ifelse((x[i] > x[i+1]), x[1], x[2])})})` Try that, rowwise operation on the columns within each row. Assuming the column for each pair is right next to each other. You also want to make sure that your df contains only the columns for characteristics. – Vlo Oct 10 '14 at 18:52
  • Thank you for your answers. @Roland I thought about reshaping, but since I have many many observations (21800 couples), I am afraid that the reshaping will take forever and that after that, **all** operations are going to take very long – Doon_Bogan Oct 15 '14 at 08:39
  • @Doon_Bogan If you use package data.table, reshaping should be almost instantaneous. – Roland Oct 15 '14 at 08:52
  • @RichardScriven: I added the reproducible example, if you also have any other suggestions. Thanks for the comment – Doon_Bogan Oct 15 '14 at 09:58

0 Answers0