0

I have this final dataset of roughly 150 000 rows per 40 columns that covers all my potential samples from 1932 to 2016, and I need to make a random selection of 53 samples per year for a total number of ~5000.

The selection in itself is really straight forward using the sample() function to get a subset, however I need to display the selection in the original dataframe to be able to check various things. My issue is the following:

If I edit one of the fields in my random subset and merge it back with the main one, it creates duplicates that I can't remove because one field changed and thus R considers the two rows aren't duplicates. If I don't edit anything, I can't find which rows were selected.

My solution for now was to merge everything in Excel instead of R, apply color codes to highlight the selected rows and delete manually the duplicates. However it's time consuming, prone to mistakes and not practicable as the dataset seems to be too big and my PC quickly runs out of memory when I try...

UPDATE:

Here's a reproducible example:

dat <- data.frame(
  X = sample(2000:2016, 50, replace=TRUE),
  Y = sample(c("yes", "no"), 50, replace = TRUE),
  Z = sample(c("french","german","english"), 50, replace=TRUE)
)

dat2 <- subset(dat, dat$X==2000)                   #samples of year 2000
sc <- dat2[sample(nrow(dat2), 1), ]                #Random selection of 1

What I would like to do is select directly in the dataset (dat1), for example by randomly assigning the value "1" in a column called "selection". Or, if not possible, how can I merge the sampled rows (here called "sc") back to the main dataset but with something indicating they have been sampled

Note:

I've been using R sporadically for the last 2 years and I'm a fairly inexperienced user, so I apologize if this is a silly question. I've been roaming Google and SO for the last 3 days and couldn't find any relevant answer yet.

I recently got in a PhD program in biology that requires me to handle a lot of data from an archive.

Florian
  • 24,425
  • 4
  • 49
  • 80
C.Den
  • 13
  • 4
  • 1
    Welcome to StackOverflow! Please read the info about [how to ask a good question](http://stackoverflow.com/help/how-to-ask) and how to give a [reproducible example](http://stackoverflow.com/questions/5963269). This will make it much easier for others to help you. – zx8754 Jan 19 '18 at 09:58
  • 1
    Sorry, it's really not clear what exactly you're asking. Please provide a reproducible example on a minimal data set – MichaelChirico Jan 19 '18 at 09:58
  • Please provide a reproducible example. To me your latest update dose not clarify your question. – www Jan 19 '18 at 10:20
  • Updated with a reproducible example – C.Den Jan 19 '18 at 10:47

1 Answers1

1

EDIT: updated based on comments.

You could add a column that indicates if a row is part of your sample. So maybe try the following:

df = data.frame(year= c(1,1,1,1,1,1,2,2,2,2,2,2), id=c(1,2,3,4,5,6,7,8,9,10,11,12),age=c(7,7,7,12,12,12,7,7,7,12,12,12))

library(dplyr)
n_per_year_low_age = 2
n_per_year_high_age = 1
df <- df %>% group_by(year) %>% 
  mutate(in_sample1 = as.numeric(id %in% sample(id[age<8],n_per_year_low_age))) %>% 
  mutate(in_sample2 = as.numeric(id %in% sample(id[age>8],n_per_year_high_age))) %>%
  mutate(in_sample = in_sample1+in_sample2) %>%
  select(-in_sample1,-in_sample2)

Output:

# A tibble: 12 x 4
# Groups: year [2]
    year    id   age in_sample
   <dbl> <dbl> <dbl>     <dbl>
 1  1.00  1.00  7.00      1.00
 2  1.00  2.00  7.00      1.00
 3  1.00  3.00  7.00      0   
 4  1.00  4.00 12.0       1.00
 5  1.00  5.00 12.0       0   
 6  1.00  6.00 12.0       0   
 7  2.00  7.00  7.00      1.00
 8  2.00  8.00  7.00      0   
 9  2.00  9.00  7.00      1.00
10  2.00 10.0  12.0       0   
11  2.00 11.0  12.0       0   
12  2.00 12.0  12.0       1.00

Futher operations are then trivial:

# extracting your sample
df %>% filter(in_sample==1)
# comparing statistics of your sample against the rest of the population
df %>% group_by(year,in_sample) %>% summarize(mean(id))
Florian
  • 24,425
  • 4
  • 49
  • 80
  • Thanks for the example. However when trying your code I run into an error for the last two lines, telling me the column "in_sample" doesn't exist. I'm not familiar with dplyr, but it seems it allows for direct operations on the dataframe without actually modifying it? – C.Den Jan 19 '18 at 10:41
  • Hi C.Den! In my answer I did not store the modified object, you have to assign it back to the variable again. So `df <- df %>% group_by(year) %>% mutate( in_sample = as.numeric(id %in% sample(id,n_per_year)))` instead of just `df %>% group_by(year) %>% mutate( in_sample = as.numeric(id %in% sample(id,n_per_year)))`. I modifed my answer accordingly, hope that helps! – Florian Jan 19 '18 at 10:49
  • Damn, I feel stupid now. Your solution seems nice, now I need to adapt it to my dataset. In my post I oversimplified but in reality my selection is composed as follows: for each year I have to select 50 samples of age 8 (for chronology) + 3 samples of age >10 (for isotopes). Because the archives have a high proportion of non usable samples (~40%) I always do a selection of 100+6 samples to have some margin. How could I implement these selection criteria using dplyr's syntax? – C.Den Jan 19 '18 at 12:02
  • No problem, I updated my answer, I hope that explains itself. You can easily adjust the conditions `age<8`, e.g. `age==8`. In the future however, please consider posing any additional adjustments to your question as a new question ;) Hope this helps! – Florian Jan 19 '18 at 12:35
  • Alright, will keep that in mind! Thanks for all these answers. – C.Den Jan 19 '18 at 12:47