r script to reshape and count columns within dataset

Question

Hi I have a dataset of freshwater fish in a range of sites, with repeat monthly visits over many years. Each row has the species found, the total and the status (i.e. positive or negative for a test result).

sample_ID   site    coll_date   species TOT inf_status
382870  site 1  27/10/2007  Species B   1   positive
382872  site 2  27/10/2007  Species D   1   positive
487405  site 3  28/10/2007  Species A   1   positive
487405  site 3  28/10/2007  Species A   1   positive
382899  site 4  03/11/2007  Species A   1   positive
382900  site 5  03/11/2007  Species A   1   positive
382901  site 5  03/11/2007  Species A   1   positive
382902  site 6  03/11/2007  Species A   1   positive
382903  site 7  09/12/2007  Species B   1   positive
382904  site 8  05/02/2008  Species C   9   negative
382905  site 8  05/02/2008  Species A   13  negative
382906  site 9  14/02/2008  Species A   1   positive
382907  site 9  14/02/2008  Species A   1   positive

I need to reformat the data so that there is just one row per site visit (i.e. in a given site name and date combo) with columns for total found by species and the fish status (i.e. speciesA_pos, SpeciesA_neg, Sp_B_pos.. etc).

site    coll_date   SP_A_pos    SP_A_neg    SP_B_pos    SP_B_neg    SP_C_pos      SP_C_neg  SP_D_pos    SP_D_neg
site 1  27/10/2007  0   0   1   0   0   0   0   0
site 2  27/10/2007  0   0   0   0   0   0   1   0
site 3  28/10/2007  3   0   0   0   0   0   0   0
site 4  03/11/2007  1   0   0   0   0   0   0   0
site 5  03/11/2007  2   0   0   0   0   0   0   0
site 6  03/11/2007  1   0   0   0   0   0   0   0
site 7  09/12/2007  0   0   1   0   0   0   0   0
site 8  05/02/2008  0   13  0   0   0   9   0   0
site 9  14/02/2008  2   0   0   0   0   0   0   0

figured I could use the reshape function but still need to sum within site visits as reshape will take the first row. My thoughts were to use split/apply/aggregate/for loops etc but tried various combinations and not getting anywhere. apologies I'm not familiar with R. any comments appreciated!

I guess you need `dcast` i..e `library(reshape2);dcast(df1, site+coll_date~species + inf_status, length)` — akrun, Sep 07 '16 at 12:09
wow thankyou Akrun that is an elegant solution. and apologies I didn't realise it was a duplicate question — ignatius aelurus, Sep 07 '16 at 12:39
@akrun this doesn’t work for the rows where the count is more than 1 e.g. for site 8 – where the total is 9 but the script above returns 1. Any thoughts? — ignatius aelurus, Sep 07 '16 at 14:26

score 1 · Accepted Answer · answered Sep 07 '16 at 17:39

1

With tidyr/dplyr, you can make a new variable that represents the combination of species and status, sum up the total for each site/date/species-status, and then spread the species-status into columns, filling with the sum.

library(tidyr)
library(dplyr)

dat %>%
    unite(sp_status, species, inf_status) %>%
    group_by(site, coll_date, sp_status) %>%
    summarise(TOT = sum(TOT)) %>%
    spread(key = sp_status, value = TOT, fill = 0)

The same thing can be done in reshape2 with dcast, taking advantage of the ability of dcast to aggregate and reshape to wide format simultaneously.

library(reshape2)
dcast(dat, site + coll_date ~ species + inf_status, value.var = "TOT", fun.aggregate = sum)

answered Sep 07 '16 at 17:39

aosmith

34,856
9
84
118

excellent i was almost there with the dcast code but had kept getting error 'sum not avail for factors'. Thanks so much – ignatius aelurus Sep 08 '16 at 10:54
Sounds like `TOT` is a factor (but shouldn't be). This likely happened when you read your data in; you probably have some sort of characters or symbols in that column. This often happens due to missing values, such as using "." or "na" or "N/A" as the missing value code, which can be defined as actually missing upon reading the dataset into R. Best to figure that out. If the problem cells should be NA, you could then do `dat$TOT = as.numeric(as.character(dat$TOT))` to convert. – aosmith Sep 08 '16 at 14:11

r script to reshape and count columns within dataset

1 Answers1