0

I would like to take the unique rows of a data frame and then join it with another row of attributes. I'd then like to be able to count up the number of varieties, e.g. the number of unique fruits of a particular type or origin.

The first data frame has my list of fruits:

fruits <- read.table(header=TRUE, text="shop    fruit
                    1   apple
                    2   orange
                    3   apple
                    4   pear
                    2   banana
                    1   banana
                    1   orange
                    3   banana")

The second data frame has my attributes:

fruit_class <- read.table(header=TRUE, text="fruit  type    origin
apple   pome    asia
                      banana  berry   asia
                      orange  citrus  asia
                      pear    pome    newguinea")

Here's my clumsy solution to the problem:

fruit <- as.data.frame(unique(fruit[,2])) #get a list of unique fruits
colnames(fruit)[1] <- "fruit" #this won't rename the column and I don't know why...
fruit_summary <- join(fruits, fruit_class, by="fruit" #create a data frame that I can query
count(fruit_summary, "origin") #for eg, summarise the number of fruits of each origin

So my main question is: how can this be expressed more elegantly (i.e. a single line rather than 3)? Secondarily: why won't it allow me to rename the column?

Thanks in advance

Joe
  • 8,073
  • 1
  • 52
  • 58
setbackademic
  • 143
  • 3
  • 11
  • 3
    In base: `aggregate(fruit ~ origin, merge(fruits, fruit_class), FUN = length)` or dplyr: `fruits %>% left_join(fruit_class) %>% count(origin)` – alistaire Oct 25 '16 at 05:36
  • 1
    Your base code tells me that there are 12 fruit from asia and 4 from new guinea, so it's summing up the fruits$shop column (which I don't want to use). The results should be 3 fruits from asia (apple, banana and orange) and one from new guinea (pear). – setbackademic Oct 25 '16 at 05:46
  • I get 7 and 1, but if you just want to count the origins from `fruit_class`, use `count(fruit_class, origin)`. If you want to make sure they're in `fruits` first, use `fruit_class %>% semi_join(fruits) %>% count(origin)`, which in this case will return the same thing. And neither is summing `shop`; they're counting rows. – alistaire Oct 25 '16 at 05:50
  • Also see the canonical posts for [count aggregation](http://stackoverflow.com/questions/9809166/is-there-an-aggregate-fun-option-to-count-occurrences) and [joining](http://stackoverflow.com/questions/1299871/how-to-join-merge-data-frames-inner-outer-left-right). – alistaire Oct 25 '16 at 05:54

2 Answers2

0

Simply doing

table(fruit_class$fruit, fruit_class$origin)

gives you

       asia newguinea
apple     1         0
banana    1         0
orange    1         0
pear      0         1

You can add up the region numbers with colSums(). I can't think of a reason the fruits data frame is needed, because if there is a fruit here that is not in fruit_class, there is no origin data for it anyway.

By the way, in your code example, colnames(fruit)[1] <- "fruit" should work but only colnames(fruit) <- "fruit" is needed since the colnames are only 1 element long anyway.

Joe
  • 8,073
  • 1
  • 52
  • 58
0

Here is a data.table solution.

library(data.table)
setDT(fruit_class)[, uniqueN(fruit), by=type]
#      type V1
# 1:   pome  2
# 2:  berry  1
# 3: citrus  1

setDT(fruit_class)[, uniqueN(fruit), by=origin]
#       origin V1
# 1:      asia  3
# 2: newguinea  1
jlhoward
  • 58,004
  • 7
  • 97
  • 140