2

I am using R to prepare some data for a D3 visualization. The visualization was created using the following structure (this is a single row from a .csv file that is subsequently converted to JSON in javascript).

Joe.Schmoe, joe.schmoe@email.com, Sao Paulo, ["Community01", "Community02", "Community03"], 
["workgroup01","workgroup02"]

This is a single row. The headers would be:

Person, Email, Location, Communities, Workgroups

You'll notice that the Communities and Workgroup columns contain lists. Furthermore, these lists will vary in length depending on what Communities and Workgroups each individual is associated with. I recognize that this is probably not best practice with regard to data "tidyness," but it is what this viz is expecting.

So ... in R (which I'm learning), I'm finding it impossible to recreate this structure because, when I try to populate the "communities" or "workgroups" variables, R seems to expect that each variable will be of equal length.

The code that I have is reading from a data.frame which is list of the members of a particular community, and adding the name of that community to a column in a master data.frame of all employees. I'm indexing by email address because it is unique. So this particular loop looks at each individual email address in a data.frame called "commTD" and finds it in a master data.frame called "testr." If it finds it, it looks at the communities variable and either replaces an NA value with the name of the community (in this case "Technical Design"), or if the vector already exists, appends Technical Design to it:

for(i in commTD$email){
    if(i %in% testr$email){
        tmpList <- testr[which(testr$email ==i) , 'communities']

        if(is.na(tmpList)){
            tmpList <- list(c("Technical Design"))
        }

        else{        
            tmpList <- append(tmpList[[1]][1], 'Technical Design')
        }

    testr[which(testr$email ==i) , 'communities'] <- list(tmpList)
    }   
} 

This works fine for the initial replacement, but if I append a new community to the list, and then try to pass it back into the testr data.frame, I get an error:

Error in `[<-.data.frame`(`*tmp*`, which(testr$email == i), "communities", 
: replacement has 2 rows, data has 1

You'll note that I'm trying to create a list of vectors, which is just one way I've tried to figure this out. I thought maybe I could force R to see the list as a single object, even though it contains multiple items -- or in this case a vector of multiple items.

Is this just impossible in R, to have varied length vectors or lists as a single variable in a data frame?

Stan
  • 905
  • 9
  • 20
joshemig
  • 31
  • 2
  • The `data.table` package supports `list`-class columns. I think it's not supported is base R. Also, it sounds like what you're doing in a loop would be better accomplished by a merge. – Frank Jun 01 '15 at 19:08
  • Try, for example `DT <- data.table(a=1:2,b=list(c(4,5),c(4,5,6))); DT` Here's the reference: http://stackoverflow.com/a/22536321/1191259 – Frank Jun 01 '15 at 19:16
  • 1
    Thanks, @Frank. The `data.table` with `list` -class columns worked after some fumbling around. – joshemig Jun 01 '15 at 22:06

1 Answers1

3

Data frames are by definition a list of vectors of equal length, so when you ask if this is possible as a class data.frame(), no its not.

You could either use as suggested another type of object like data.table, or another way would be to think of your desired output as a list of unequal vectors, to pass to your js.

That object would look like something like:

dataList <- list(name = c("Joe.Schmoe", "Joe.Bloe"),
                 email = c("joe.schmoe@email.com", "joe.bloe@email.com"),
                 location = c("Sao Paulo", "London"),
                 Communities = list(c("Community01", "Community02", "Community03"), 
                                  c("Community02", "Community05", "Community03")
                 ),
                 Workgroups = list(c("workgroup01","workgroup02"), 
                                   c("workgroup01","workgroup03"))
                )

Then access each field like a dataframe, for output to your js:

dataList$name
dataList$Communities
etc...

As per Frank's suggestion, if you want to access each entry via the email address, so you can access each entry like this:

data_list[["joe.schmoe@email.com"]]

...then build the list with the names of the email as the index, like so:

data_list = list(`joe.schmoe@email.com`=list(name="Joe",
                                             location="Sao Paulo",
                                             Communities=....),
                 `joe.bloe@email.com`=list(n‌​ame="Joe", ...)) 

Then, you can avoid the non-R style of using for() loops, and start the fun of the lapply() family of functions to work on all the entries in a vectorised manner. (See ?lapply for details)

Hope it helps.

MarkeD
  • 2,500
  • 2
  • 21
  • 35
  • 1
    Yeah, though I think it would be better as a list of entries, one for each person... a lot easier to construct. For example, `data_list = list(\`joe.schmoe@email.com\`=list(name="Joe",...),\`joe.bloe@email.com\`=list(name="Joe"))` The OP says the data is "indexed" by email, so this arrangement seems sensible and allows for access like `data_list[["joe.schmoe@email.com"]]` – Frank Jun 01 '15 at 20:01