R: summarising a dataframe while retaining/adding together values within the dataframe

Question

R: randomly sampling a dataframe based on another dataframe with a range - background question/details

I'm still struggling with this annoying random sampling thing, it's still not doing quite what I want it to, and on top of that the data isn't quite what I previously thought it was so now the old version doesn't seem to work.

It's a similar situation: I have 2 sets of dataframes, one (dfA) with climate data that has over 100,000 rows, and one (dfB and dfC) as a list of sampling criteria (seed zone, altitude, and freq) with only about 50 rows each.

What I'm aiming to do is use dfB/dfC as a basis to randomly sample rows from dfA based on shared columns (seed zone and altitude), picking n rows based on the freq column.

The difference from the previous question is that we realised the altitudes are not continuous values, they actually fall into 2 categories: 301m and above, and 300m and below. For the above values I've set their altitude to 2000m, and for the below values I've set their altitude to 300m, and adjusted the old code accordingly to try and reproduce the range effect. This all works fine...for dfC.

dfB was annoying format-wise, having multiple rows of the same BMID with different frequencies, and looks like this:

BMID	Frequency	Seed_zone	Altitude
bnaRP105hSI	2	105	2000
bnaRP105hSI	2	105	2000
bnaRP109hSI	1	109	2000
bpeRP102SI	2	102	300
bpeRP102SI	1	102	300

Whereas dfC was much simpler to work with and I was easily able to summarise it using the code below:

dfC_count <- dfC %>% group_by(BMID, Code, Species_bin) %>% 
  summarise(Frequency=n(),
            .groups = 'drop')

This means it has no repeat rows, and the frequency is a 'total cumulative frequency':

BMID	Frequency	Seed_zone	Altitude
acaRP104SI	2	104	300
bnaRP105hSI	2	105	2000
bnaRP109hSI	1	109	2000
bpeRP102SI	2	102	300
cavRP402SI	1	402	300

My question is, how can I summarise dfB into a format like dfC whilst maintaining the correct total frequency so it looks more like:

BMID	Frequency	Seed_zone	Altitude
bnaRP105hSI	4	105	2000
bnaRP109hSI	1	109	2000
bpeRP102SI	3	102	300

Using the same code I used for dfC doesn't add the frequencies together, it just tells me how many rows have a unique combination of BMID and frequency. The raw data for dfC had 1 frequency per row so was really easy to summarise, whereas the raw data for dfB had various amounts per row (along with other info I didn't need like year)

I've tried the answers in this question, but as my dataframe has some other columns that are characters this doesn't work. I want to retain the character columns if possible, but at the very least the BMID column https://stackoverflow.com/questions/49361640/adding-together-rows-from-duplicate-entries-in-a-dataframe-excel-or-r

EDIT: I accidentally managed to fix this but I can't mark my own replies as answers. For posterity the following code works!

dfB <- dfB %>% group_by(BMID, Seed_zone, Altitude) %>% summarise(sum(Frequency),
        .groups = 'drop')

score 0 · Answer 1 · answered Jul 05 '23 at 17:51

0

I accidentally managed to fix this so posting this for posterity!

dfB <- dfB %>% group_by(BMID, Seed_zone, Altitude) %>% summarise(sum(Frequency),
    .groups = 'drop')

answered Jul 05 '23 at 17:51

user22141110

3
3

R: summarising a dataframe while retaining/adding together values within the dataframe

1 Answers1