I have a dataframe with multiple columns containing, inter alia, words and their position in sentences. For some positions, there's more rows than for other positions. Here's a mock example:
df <- data.frame(
word = sample(LETTERS, 100, replace = T),
position = sample(1:5, 100, replace = T)
)
head(df)
word position
1 K 1
2 R 5
3 J 2
4 Y 5
5 Z 5
6 U 4
Obviously, the tranches of 'position' are differently sized:
table(df$position)
1 2 3 4 5
15 15 17 28 25
To make the different tranches more easily comparable I'd like to draw equally sized samples on the variable 'position' within one dataframe. This can theoretically be done in steps, such as these:
df_pos1 <- df[df$position==1,]
df_pos1_sample <- df_pos1[sample(1:nrow(df_pos1), 3),]
df_pos2 <- df[df$position==2,]
df_pos2_sample <- df_pos2[sample(1:nrow(df_pos2), 3),]
df_pos3 <- df[df$position==3,]
df_pos3_sample <- df_pos3[sample(1:nrow(df_pos3), 3),]
df_pos4 <- df[df$position==4,]
df_pos4_sample <- df_pos4[sample(1:nrow(df_pos4), 3),]
df_pos5 <- df[df$position==5,]
df_pos5_sample <- df_pos5[sample(1:nrow(df_pos5), 3),]
and so on, to finally combine the individual samples in a single dataframe:
df_samples <- rbind(df_pos1_sample, df_pos2_sample, df_pos3_sample, df_pos4_sample, df_pos5_sample)
but this procedure is cumbersome and error-prone. A more economical solution might be a for loop. I've tried this code so far, which, however, returns, not a combination of the individual samples for each position value but a single sample drawn from all values for 'position':
df_samples <-c()
for(i in unique(df$position)){
df_samples <- rbind(df[sample(1:nrow(df[df$position==i,]), 3),])
}
df_samples
word position
13 D 2
2 R 5
12 G 3
4 Y 5
16 Z 3
11 S 3
6 U 4
14 J 3
9 O 5
1 K 1
What's wrong with this code and how can it be improved?