0

R: randomly sampling a dataframe based on another dataframe with a range - background question/details

I'm still struggling with this annoying random sampling thing, it's still not doing quite what I want it to, and on top of that the data isn't quite what I previously thought it was so now the old version doesn't seem to work.

It's a similar situation: I have 2 sets of dataframes, one (dfA) with climate data that has over 100,000 rows, and one (dfB and dfC) as a list of sampling criteria (seed zone, altitude, and freq) with only about 50 rows each.

What I'm aiming to do is use dfB/dfC as a basis to randomly sample rows from dfA based on shared columns (seed zone and altitude), picking n rows based on the freq column.


The difference from the previous question is that we realised the altitudes are not continuous values, they actually fall into 2 categories: 301m and above, and 300m and below. For the above values I've set their altitude to 2000m, and for the below values I've set their altitude to 300m, and adjusted the old code accordingly to try and reproduce the range effect. This all works fine...for dfC.

dfB was annoying format-wise, having multiple rows of the same BMID with different frequencies, and looks like this:

BMID Frequency Seed_zone Altitude
bnaRP105hSI 2 105 2000
bnaRP105hSI 2 105 2000
bnaRP109hSI 1 109 2000
bpeRP102SI 2 102 300
bpeRP102SI 1 102 300

Whereas dfC was much simpler to work with and I was easily able to summarise it using the code below:

dfC_count <- dfC %>% group_by(BMID, Code, Species_bin) %>% 
  summarise(Frequency=n(),
            .groups = 'drop')

This means it has no repeat rows, and the frequency is a 'total cumulative frequency':

BMID Frequency Seed_zone Altitude
acaRP104SI 2 104 300
bnaRP105hSI 2 105 2000
bnaRP109hSI 1 109 2000
bpeRP102SI 2 102 300
cavRP402SI 1 402 300

My question is, how can I summarise dfB into a format like dfC whilst maintaining the correct total frequency so it looks more like:

BMID Frequency Seed_zone Altitude
bnaRP105hSI 4 105 2000
bnaRP109hSI 1 109 2000
bpeRP102SI 3 102 300

Using the same code I used for dfC doesn't add the frequencies together, it just tells me how many rows have a unique combination of BMID and frequency. The raw data for dfC had 1 frequency per row so was really easy to summarise, whereas the raw data for dfB had various amounts per row (along with other info I didn't need like year)

I've tried the answers in this question, but as my dataframe has some other columns that are characters this doesn't work. I want to retain the character columns if possible, but at the very least the BMID column https://stackoverflow.com/questions/49361640/adding-together-rows-from-duplicate-entries-in-a-dataframe-excel-or-r

EDIT: I accidentally managed to fix this but I can't mark my own replies as answers. For posterity the following code works!

dfB <- dfB %>% group_by(BMID, Seed_zone, Altitude) %>% summarise(sum(Frequency),
        .groups = 'drop')

1 Answers1

0

I accidentally managed to fix this so posting this for posterity!

dfB <- dfB %>% group_by(BMID, Seed_zone, Altitude) %>% summarise(sum(Frequency),
    .groups = 'drop')