16

I want to create a data structure in the form of

Start, End, Elements
  3  , 6  ,  {4,5}
  4 ,  10 ,  {7,8,9}
   ....

In words, I am moving a ball along a line. The "start" represents the left most position of the ball and the "End" represents the right most. The "Elements" means I somehow find those positions special. What is the best data structure to use when the number of elements can grow very large? The only thing I can think of is a data frame where the 3rd column is an appropriately formatted string. I would then have to parse the string if I wanted to look at each number in the set. Is there a better data format that R has or is that about it?

Thanks!

user1357015
  • 11,168
  • 22
  • 66
  • 111

2 Answers2

17

The option mentioned in my comment, i.e. simply using a list for one of the columns:

dat <- data.frame(Start = 3:4, End = c(6,10))
> dat
  Start End
1     3   6
2     4  10
> dat$Elements <- list(4:5,7:9)
> dat
  Start End Elements
1     3   6     4, 5
2     4  10  7, 8, 9

You could also of course ditch data frames entirely and simply use a plain old list (which might make more sense in a lot of cases, anyway):

list(list(Start = 3,End = 6, Elements = 4:5),list(Start = 4,End = 10,Elements = 7:9))
[[1]]
[[1]]$Start
[1] 3

[[1]]$End
[1] 6

[[1]]$Elements
[1] 4 5


[[2]]
[[2]]$Start
[1] 4

[[2]]$End
[1] 10

[[2]]$Elements
[1] 7 8 9
joran
  • 169,992
  • 32
  • 429
  • 468
7

You could store it as a tall data frame rather than a wide one, and probably use the data.table to process it efficiently. That is, make one row per element rather than one row per start-end pair

library(data.table)
dt = data.table(Start=c(3, 3, 4, 4, 4), End=c(6, 6, 10, 10, 10), Elements=c(4, 5, 7, 8, 9))
#   Start End Elements
#1:     3   6        4
#2:     3   6        5
#3:     4  10        7
#4:     4  10        8
#5:     4  10        9

This would let you do multiple kinds of processing on the data quite easily, such as determining how many elements are in each range:

dt[, list(Num.Elements=length(Elements)), by=c("Start", "End")]

#    Start End Num.Elements
# 1:     3   6            2
# 2:     4  10            3

This would also make the data easy to use for plots using the ggplot package, which usually expects data to be in a tall format.

You might note that this data structure wasteful since it repeats the Start and End for each element. However, data tables are stored very efficiently- even if your list of elements is literally millions long it can easily fit and be processed in this manner. Try a line like:

dt = data.table(Start=1:1e6, End=1:1e6, Elements=1:1e6)

for a demonstration. It would certainly be faster to deal with than keeping each element list as a string and splitting it each time.

David Robinson
  • 77,383
  • 16
  • 167
  • 187
  • I didn't realize that saving it as a list as joran mentioned above is an option. Now I just save it as a list of numeric elements which becomes easy to do. Thanks for the idea though, I wasn't even thinking of long format until you mentioned! – user1357015 Apr 12 '13 at 20:51