2

I have a data source which can be represented as a list of maybe 30 dataframes (I'm saying dataframes, but if a better structure exists feel free to answer using that - nothing has been written yet). Each dataframe doesn't have any direct relationship to the other, but each has 3 columns and n rows. Column 1 is labels, column 2 and 3 are values (always numeric).

For tangibility, imagine each dataframe is a list of foods eaten at a party, and each column represents the number of each item Alice and Bob ate.

For example

A = [5 x 3] # (apples, pears, cookies, grapes, watermelon)
            # ---------------------------------------------------
            # item           Alice     Bob
            # apples           3        7
            # pears            1        2
            # cookies         10        4   
            # grapes         238      483
            # watermelon       0        1 
            # ---------------------------------------------------

B = [1 x 3] # (grapes)
C = [3 x 3] # (beef, rice, apples)
...
Z = [4 x 3] # (rice, grapes, watermelon, beef)

I want to represent these matrices as a data structure, such that I can ask

  • General questions above all items - e.g. which of Alice and Bob ate most items overall, what was their average, standard deviation, etc
  • Label specific questions - e.g. which of Alice and Bob at most grapes?

Whenever I have this kind of problem I always end up writing really ugly code which has lists of lists, requires the [] operator or as.matrix()/as.list()/as.dataframe() functions, and generally seems like a really crappy way of doing things.

What would be a goood/the best approach for these kind of data?

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
Alex
  • 2,000
  • 4
  • 23
  • 41
  • 3
    how about binding all of them together with `rbind` (and having their names as a separate column called say 'ID' (= A, B, ... Z)? – Arun Jun 15 '13 at 15:46
  • If multiple data frames have the sam item, how should they be combined? E.g. in your example both A and B contain a row for grapes. – Ryan C. Thompson Jun 15 '13 at 17:46

1 Answers1

4

Following on @Arun's comment, you can easily create a single data frame with another column indicating the party in question:

A = read.table(text="item           Alice     Bob
                     apples           3        7
                     pears            1        2
                     cookies         10        4   
                     grapes         238      483
                     watermelon       0        1", header=T)

B = read.table(text="item           Alice     Bob 
                     grapes          13       26", header=T)

C = read.table(text="item           Alice     Bob 
                     beef             1        3
                     rice             1        2
                     apples           1        0", header=T)

Z = read.table(text="item           Alice     Bob 
                     rice             2        1
                     grapes          10       15
                     watermelon       1        0
                     beef             0        2", header=T)

A$party = "A";    B$party = "B";    C$party = "C";    Z$party = "Z"
dframe = rbind(A, B, C, Z)

From there, you can get functions of the columns without difficulty:

apply(dframe[,2:3], 2, sum)

If you wanted to deal with individual items, and they had duplicates between the parties, you could perform joins on the original data frames. There's an SO thread on doing this in R here.

Community
  • 1
  • 1
gung - Reinstate Monica
  • 11,583
  • 7
  • 60
  • 79