0

I am very stuck on a basic question about summarising categorical data. My raw data consists of multiple records of the form UserId, ItemId, CategoryID. For each ItemID there is a fixed CategoryID. For each UserID, there is a fixed GroupID. There can be an artibrary number of entries for each UserId, but only one per ItemID. At the moment when I am reading in the data from .csv I am setting every column as a factor.

Here is a toy data set:

uIDs <- c("1", "1", "3", "8", "3", "8", "6")
iIDs <- c("a", "c", "d", "d", "e", "f", "g")
cIDs <- c("V", "V", "A", "A", "A", "A", "M")
gIDs <- c("U", "U", "N", "U", "N", "U", "P")
foo <- data.frame(uID = uIDs, iID = iIDs, cID = cIDs, gID = gIDs)

From this data set I need to extract, in usable form, various summaries, such as:

  • for each uID, how many iIDs are there?
  • for each uID, how many cIDs are there?
  • for each iIDs, how many uIDs are there?
  • for each cID, how many uIDs are there?
  • for each cID, how many gIDs are there?
  • for each gID, how many cIDs are there?

Very straightforward stuff, but I have spent most of the day struggling with it. I am particularly confused by the various ways in which output is returned, in the various functions which can be used to help with this (aggregate, summary, by, table, and friends). Let's take as an example, summary. Its output looks really useful. But I can't figure out how to get at it.

     summary(foo)
 uID    iID   cID   gID  
  8:1   a:1   A:4   N:2  
 1 :2   c:1   M:1   P:1  
 3 :2   d:2   V:2   U:4  
 6 :1   e:1              
 8 :1   f:1              
        g:1

When I ask the result what it is, the result is very complex and I don't know how to strip it down to get at what I want.

    > str(summary(foo))
 'table' chr [1:6, 1:4] " 8:1  " "1 :2  " "3 :2  " "6 :1  " ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:6] "" "" "" "" ...
  ..$ : chr [1:4] "uID" "iID" "cID" "gID"

Given my needs, which are simple, what is the most straightforward way of asking my question so that I can get a result I can easily manipulate further?

thanks!

p.s. sorry if the code pasting isn't in the right format - trying to paste in from Rstudio but it doesn't look right - advice welcome (tried to search for advice didn't find anything but I know it's there somewhere as I read it about 6 months ago...)

Matthew Lundberg
  • 42,009
  • 6
  • 90
  • 112
Heather Stark
  • 605
  • 7
  • 18
  • 1
    NB note error " 8" in toy data example – Stephen Henderson Nov 27 '13 at 17:11
  • fixed - thanks - also n.b. a wording clarification - when I said "There can be an artibrary number of entries for each UserId, but only one per ItemID", what I meant that for a given UserID and a given ItemID there can be at most one corresponding row entry. – Heather Stark Nov 27 '13 at 18:22
  • thanks for editing my code example Jilber - can you point me at a place that explains the best way to paste in examples? (I am working in RStudio). ta! – Heather Stark Nov 30 '13 at 12:53
  • 1
    If you use the `dput(someData)` command it will output a text representation that you could cut and paste into a console (or here on SO) to recreate the data. If you have huge dataset then you could `dput(head(someData))` – Stephen Henderson Nov 30 '13 at 13:05
  • do I type that into the console, then cut and paste from the console window? – Heather Stark Nov 30 '13 at 13:15
  • Indeed try it. Create a list or data.frame, try `dput`-ing it. then creat e a copy in your own console by `newDataCopy <-` "paste spaghetti here". For toy cases/ websites this is easier more secure than sharing binary RData files. – Stephen Henderson Nov 30 '13 at 13:25
  • ps there is a wider discussion on this topic here: http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – Stephen Henderson Nov 30 '13 at 13:26

2 Answers2

1

You can use aggregate. I think this is what you're looking for:

> # for each uID, how many iIDs are there? and 
> # for each uID, how many cIDs are there?
> aggregate(cbind(iIDs, cIDs) ~ uID, length, data=foo)
  uID iIDs cIDs
1   1    2    2
2   3    2    2
3   6    1    1
4   8    1    1 # due to the error in the toy example there are two 8
5   8    1    1 # one for "8" and one for " 8" ;)
> 
> # or individually:
> # aggregate(uIDs ~ iID, length, data=foo) 
> # aggregate(uIDs ~ cID, length, data=foo)
>  
> #-------------------------------------------------------------
> # for each iIDs, how many uIDs are there?
> aggregate(uIDs ~ iID, length, data=foo)
  iID uIDs
1   a    1
2   c    1
3   d    2
4   e    1
5   f    1
6   g    1
> #-------------------------------------------------------------
> 
> # for each cID, how many uIDs are there? and
> # for each cID, how many gIDs are there?
> aggregate(cbind(uIDs, gIDs) ~ cID, length, data=foo)
  cID uIDs gIDs
1   A    4    4
2   M    1    1
3   V    2    2
> 
> #-------------------------------------------------------------
> # for each gID, how many cIDs are there?
> aggregate(cIDs ~ gID, length, data=foo)
  gID cIDs
1   N    2
2   P    1
3   U    4
Jilber Urbina
  • 58,147
  • 10
  • 114
  • 138
  • typo in dataset now fixed - which might have been the wrong thing to do! thank you for your answer, which shows me some new ways of using aggregate. – Heather Stark Nov 27 '13 at 18:34
  • gave the answer credit to first past the post Stephen - but your answer is great too - need to study them both, many thanks! – Heather Stark Nov 27 '13 at 18:36
1

You can answer most of those questions like so:

  • for each uID, how many iIDs are there?

with(foo, rowSums(table(uID, iID)))

1 3 6 8 
2 2 1 2 

NB I think there is a slight error in your example data.. one of your uID is " 8" rather than "8" which confused me for a bit.

Stephen Henderson
  • 6,340
  • 3
  • 27
  • 33
  • ah, useful. I hadn't yet worked out how to use the output of 'table' as an input to further calculations, as I couldn't figure out how to tame it into something simple - and rowSums seems to do the required squishing and loses those extra structural bits. – Heather Stark Nov 27 '13 at 18:31