2

I am new to R and I don't want to misunderstand the language and its data structure from the beginning on. :)

My data.frame sample.data contains beside 'normal' attributes (e.g. author) another, nested list of data.frame (files), which has e.g. the attributes extension.

How can I filter for authors who have created files with a certain extension? Is there a R-ic way of doing that? Maybe in this direction:

t <- subset(data, data$files[['extension']] > '.R')

Actually I want to avoid for loops.

Here you can find some sample data:

d1 <- data.frame(extension=c('.py', '.py', '.c++')) # and some other attributes
d2 <- data.frame(extension=c('.R', '.py')) # and some other attributes

sample.data <- data.frame(author=c('author_1', 'author_2'), files=I(list(d1, d2)))

The JSON the sample.data comes from looks like

[
    {
        "author": "author_1",
        "files": [
            {
                "extension": ".py",
                "path": "/a/path/somewhere/"
            },
            {
                "extension": ".c++",
                "path": "/a/path/somewhere/else/"
            }, ...
        ]
    }, ...
]
Mike Wise
  • 22,131
  • 8
  • 81
  • 104
Michael Dorner
  • 17,587
  • 13
  • 87
  • 117
  • You got me wrong. I do use `jsonlite` and I do get `data.frame` yet. But now the question how to analyze the data easily and efficiently? – Michael Dorner Jul 14 '15 at 08:51
  • Would your data be better described as a list? Data frames are intended for tabular data, but if the structure is more complicated (as in some JSON files) then a list can be more appropriate. – jkeirstead Jul 14 '15 at 08:51
  • I added the JSON the data comes from. `files` are becoming a `data.frame` within the containing `data.frame` for all dictionaries. And yes, it is not trivial. :-D – Michael Dorner Jul 14 '15 at 08:56
  • You don't mean 'nested dataframe', you mean *'dataframe which references other dataframes, e.g. in a SQL-like schema'*. The usual way is to use join operations, e.g. with dplyr package. Store `authors` and `files` in separate tables/dataframes. – smci Jul 14 '15 at 08:57
  • Creating sample data greatly helps in finding a solution – Pierre L Jul 14 '15 at 09:16
  • Hey, what happened to the json? Please put it back :) – Mike Wise Jul 14 '15 at 09:51

4 Answers4

6

There are at least a dozen ways of doing this, but if you want to learn R right, you should learn the standard ways of subsetting data structures, especially atomic vectors, lists and data frames. This is covered in chapter two of this book:

http://adv-r.had.co.nz/

There are other great books, but this is a good one, and it is online and free.

UPDATE: Okay, this converts your json to a list of data frames.

library("rjson")
s <- paste(c(
'[{' ,
'  "author": "author_1",',
'  "files": [',
'    {',
'     "extension": ".py",',
'     "path": "/a/path/somewhere/"',
'   },',
'   {',
'     "extension": ".c++",',
'     "path": "/a/path/somewhere/else/"',
'    }]',
'},',
'{',
'"author": "author_2",',
'"files": [',
'  {',
'    "extension": ".py",',
'    "path": "/b/path/somewhere/"',
'  },',
'  {',
'    "extension": ".c++",',
'    "path": "/b/path/somewhere/else/"',
'  }]',
'}]'),collapse="")

j <- fromJSON(s)

todf <- function (x) {
    nrow <- length(x$files)
    vext <- sapply(x$files,function (y) y[[1]])
    vpath <- sapply(x$files,function (y) y[[2]])
    df <- data.frame(author=rep(x$author,nrow),ext=vext,path=vpath)
}
listdf <- lapply(j,todf)
listdf

Which yields:

[[1]]
    author  ext                    path
1 author_1  .py      /a/path/somewhere/
2 author_1 .c++ /a/path/somewhere/else/

[[2]]
    author  ext                    path
1 author_2  .py      /b/path/somewhere/
2 author_2 .c++ /b/path/somewhere/else/

And to finish the task, merge and subset:

   mdf <- do.call("rbind", listdf)
   mdf[ mdf$ext==".py", ]

yielding:

    author ext               path
1 author_1 .py /a/path/somewhere/
3 author_2 .py /b/path/somewhere/
Mike Wise
  • 22,131
  • 8
  • 81
  • 104
  • You can create a list of data.frames, which is probably what you want. A data.frame is actually a list of (usually) atomic vectors, so this is very close to being a data frame of data frames. – Mike Wise Jul 14 '15 at 08:57
  • 1
    Hmm, not straightforward. Working on it. – Mike Wise Jul 14 '15 at 09:35
  • This doesn't actually answer the question, which is how to subset a data frame containing a column of data frames (ie, a hierarchical structure). – Hong Ooi Jul 14 '15 at 10:20
  • It answers the question that he posted and then changed. I am waiting for his response before I delete all this work :). – Mike Wise Jul 14 '15 at 10:22
  • See the first comment. He posted a json file to go along with it. – Mike Wise Jul 14 '15 at 10:23
3

Assuming your data frame df, as a CSV, looks like:

author,path,extension
john,/home/john,txt
mary,/home/mary,png

then the easiest solution is to use the dplyr package:

library(dplyr)
filter(df, author=="john" & extension=="txt") 
jkeirstead
  • 2,881
  • 3
  • 23
  • 26
  • 1
    True, the easiest is `subset`. But learning the dplyr verbs will help in the long run. `subset` can cause problems http://stackoverflow.com/questions/9860090/in-r-why-is-better-than-subset – jkeirstead Jul 14 '15 at 08:44
3

Interesting, not many people use R to simulate a hierarchical database!

subset(sample.data, sapply(files, function(df) any(df$extension == ".R")))
Hong Ooi
  • 56,353
  • 13
  • 134
  • 187
  • I assume your answer imply that R is maybe not the right tool to choose? – Michael Dorner Jul 14 '15 at 10:54
  • 1
    @MichaelDorner Well, it's not designed for the task, but then it's not really designed for relational data either, and people use it for that all the time. You work with the data you have. R's flexibility as a programming language means it's possible to handle hierarchical data without too many problems. – Hong Ooi Jul 14 '15 at 12:22
  • Thanks for this comment, this was very helpful. Not too many, but at least enough problems! I think I will go with some database. :) – Michael Dorner Jul 14 '15 at 12:58
2

I guess grep() function in base package could be your solution:

files <- data.frame(path = paste0("path", 1:3), extension = c (".R", ".csv", ".R")
                    , creation.date = c(Sys.Date()+1:3))

> files
# path extension creation.date
# 1 path1        .R    2015-07-15
# 2 path2      .csv    2015-07-16
# 3 path3        .R    2015-07-17


> files[grep(".R", files$extension),]
# extension creation.date
# 1 path1        .R    2015-07-15
# 3 path3        .R    2015-07-17
Andriy T.
  • 2,020
  • 12
  • 23