0

My question stems out of this and this question asked previously and sufficiently different from them I think. Imagine that I have a minimal dataset (bird) where every row represents an observation of birds at a given time and place as follows:

id,obs,country
A,4,USA
B,3,CAN
A,5,USA
C,4,MEX
C,1,USA
A,3,CAN
D,1,null

What I ideally want is a conversion of this dataset into a form like this removing the nulls from the dataset:

id,tot_obs,country_tot
A,12,2
B,3,1
C,5,2

I know that I can get a count of factors using:

table(bird$country)

but, is there a smarter, perhaps, one line way of removing the nulls, adding up the total counts, finding the counts of the countries and then reconfiguring them into this form? If there is a package which does this, then I am open to that suggestion as well. Thanks !

Community
  • 1
  • 1
Shion
  • 395
  • 1
  • 3
  • 13

1 Answers1

2

Load data with stringsAsFactors=FALSE:

df <- read.csv(header=TRUE, text="id,obs,country
A,4,USA
B,3,CAN
A,5,USA
C,4,MEX
C,1,USA
A,3,CAN
D,1,null", stringsAsFactors=FALSE)

# check to see if columns are factors
sapply(df, class)
#          id         obs     country 
# "character"   "integer" "character" 

Remove all rows with country = null

df <- df[df$country != "null", ]

Then you can use plyr package with summarise to get the desired result as follows:

ddply(df, .(id), summarise, tot_obs=sum(obs), tot_country=length(unique(country)))
#   id tot_obs tot_country
# 1  A      12           2
# 2  B       3           1
# 3  C       5           2
Arun
  • 116,683
  • 26
  • 284
  • 387
  • I tried out the exact code as above. It gave me an error: `Error in Summary.factor(c(45L, 3L, 165L, 3L, 165L, 3L, 70L, 165L, 165L, : sum not meaningful for factors` – Shion Mar 21 '13 at 01:41
  • yes. make sure your columns are not factors. load your csv file with `stringsAsFactors=FALSE` if need be. – Arun Mar 21 '13 at 06:48
  • Perfect ! Thank you so much for your help. It all works well now. – Shion Mar 21 '13 at 17:17