0

I have a dataframe df with about 1000 rows with the following columns:

df$ID<- c("ab11", "ab12" ...) #about 1000 rows
df$ID1<-numbers ranging from 1 to 20k # for all intense and purposes this can be treated as class 'factor'
df$Acol<- #numbers ranging from 1 to 1000
df$Bcol<- #numbers ranging from 0 to 1

The following lists gives me 12 values in each list:

A<- seq(50,600,by=50)
B<- seq(0.2,1,by=0.75)

I am trying to do 2 things:

  1. I would like to create dataframes by filtering the original dataset with various combinations of lists A and B. So 144 dataframes.
  2. Once I have those dataframes I would like to combine 3 dataframes at a time and see if the frequency of the IDs match a master dataframe x and if they do, get the combinations information for the matching dataframe.

So for 1, this is my approach:

df_50_0.2<-subset(df, df$Acol>=50 & df$Bcol>=0.2)

I can't write that out 144 times- I need a loop. I tried nested loop but that doesn't give me every combination of A and B so I tried a while loop.

Here is my code:

i<-50
while (i<550) {
   for (j in B) {
      assign(paste("df","_",as.character(i),"_",as.character(j)), df %>% 
      filter (Acol>=i) %>% 
      filter(Bcol>=j),envir=.GlobalEnv
   i<-i+50
  }}   

That give me the desired result except it doesn't split the dataframe according to B. So the output is similar to what I would have if I had just filtered the data with values of A.

For the second part I need to loop through all possible combinations of three data frames at a time. Here is my code:

df.final<-rbind (df_50_0.2,df_100_0.25,df_150_0.5)
tmp<-subset(table(df.final$ID),!(table(df.final$ID) %in% table(x$ID))

I would like the above to be in a loop. If tmp has any values, I don't want it to be an output. If it is 0, that is, it is a perfect match to the frequency of IDs in the master dataframe x, I want that to be written. So something like the following in a loop? I want all possible combinations checked iteratively to come up with the combinations that match the master dataframe x ID frequency perfectly:

if tmp = NULL
tmp
else rm(tmp)

Any help is much appreciated. A python solution is also welcome!

A solution available in the following link but modified for two columns could be helpful Filter loop to create multiple data frames

ilearn
  • 1
  • 1
  • Could you provide a minimal reproducible example? Try to make a toy example that would help us understand what you want to do. – Joao Pedro Macalos Nov 07 '20 at 20:53
  • Are you trying to create overlapping data frames? The command `df_50_0.2<-subset(df, df$A>=50 & df$B>=0.2)` will include all of the observations. The next one would include all but the ones in both of the smaller groups of A and B and so on. – dcarlson Nov 07 '20 at 21:16
  • @dcarlson, Yes. The data frames will be overlapping – ilearn Nov 08 '20 at 21:58
  • @JoaoPedroMacalos, A solution available in the following link but modified for two columns could be helpful https://stackoverflow.com/questions/54679034/filter-loop-to-create-multiple-data-frames – ilearn Nov 09 '20 at 17:50

1 Answers1

1

This is not a complete answer, but it represents a different way to think about your problem. First we need a reproducible sample. I'm leaving out your first two columns, but it would not be difficult to modify the example to include them.

set.seed(42)
df <- data.frame(A=runif(200) * 1000 + 1, B=runif(200))
Aint <- seq(50, 600, by=50)
Bint <- seq(0.2, 1, by=0.05)
comb <- expand.grid(Aint=Aint, Bint=Bint)

This gives you a data frame of 200 observations and columns A and B. Then we create the intervals (called Aint and Bint so we don't confuse them with df$A and df$B. The next line creates all possible combinations of those intervals - 204 (not 144 since Bint has 17 values).

Now we need to define your groups and generate the label for each one:

groups <- apply(comb, 1, function(x)   df$A > x[1] & df$B > x[2])
labels <- paste("df", comb[, 1], comb[, 2], sep="_")

groups is a logical matrix 200 x 204 so each column defines one of your comparisons. The first group is defined as df[groups[, 1], ] and the second as df[groups[, 2], ]. The labels for those comparisons are labels[1] and labels[2].

Since your groups are overlapping, the next step will create multiple copies of your data in each group of three. The number of combinations of 204 groups taken 3 at a time is 1,394,204. The following generates the code to create the different combinations of 3 data frames:

allcombos <- combn(204, 3, simplify=TRUE)
for (i in 1:5) {
    dfno <- allcombos[, i]
    df.final <- rbind(df[groups[, dfno[1]], ], df[groups[, dfno[2]], ], df[groups[, dfno[3]], ])
    lbl <- labels[dfno]
    
}

The next step involves subset(table(df.final$ID),!(table(df.final$ID) %in% table(x$ID)) but that line does not do what you suggest. It creates a frequency table of the number of times each ID appears in df.final and a table for a master data frame x. The %in% statement processes each frequency in df.final and checks to see if it matches the frequency of ANY ID in master. So I have not included that logical statement or code that would write the current df.final to a list.

dcarlson
  • 10,936
  • 2
  • 15
  • 18
  • This is a very good approach - no need to duplicate the data and clutter the workspace. – Gregor Thomas Nov 09 '20 at 18:03
  • Thank you @dcarlson. This is powerful. Not sure how to proceed from here. I want to do df.final<-rbind(df[groups[, 1], ],df[groups[, 2], ],df[groups[, 3], ]) tmp<-subset(table(df.final$ID),!(table(df.final$ID) %in% table(x$ID)) if tmp = NULL tmp else rm(tmp) but iteratively for all possible combinations of 3 datasets. – ilearn Nov 09 '20 at 19:49
  • @dcarlson, I also modified my original question in hopes of providing clarity for the second part – ilearn Nov 09 '20 at 21:55
  • @dcarlson, Thanks again. What is allcombos in your for loop? So I am indeed trying to create a frequency table of the number of times each ID appears in df.final and a table for a master data frame x. Then using %in% statement process each frequency in df.final and check to see if it matches the frequency of ANY ID in master. If there is a complete match, (that is if i save it to tmp and tmp is empty), I would like to write out tmp. If they don't match I want tmp to be deleted. – ilearn Nov 10 '20 at 00:10
  • @dcarlson, if ‘groups’ that you created had 4 columns, I would do: df.final<-rbind(df[groups[, 1], ],df[groups[, 2], ],df[groups[, 3], ]) tmp<-subset(table(df.final$ID),!(table(df.final$ID) %in% table(x$ID)) tmp if tmp was empty, I will note this combination. if it had values that is, there were mismatches, i would delete tmp. rm(tmp) df.final<-rbind(df[groups[, 2], ],df[groups[, 3], ],df[groups[, 4], ]) tmp<-subset(table(df.final$ID),!(table(df.final$ID) %in% table(x$ID)) tmp rm(tmp) – ilearn Nov 10 '20 at 00:29
  • @dcarlson, putting it in two comments. too long for one. df.final<-rbind(df[groups[, 1], ],df[groups[, 2], ],df[groups[, 4], ]) tmp<-subset(table(df.final$ID),!(table(df.final$ID) %in% table(x$ID)) tmp rm(tmp) df.final<-rbind(df[groups[, 1], ],df[groups[, 3], ],df[groups[, 4], ]) tmp<-subset(table(df.final$ID),!(table(df.final$ID) %in% table(x$ID)) tmp rm(tmp) so how do I write a loop to not write this steps manually – ilearn Nov 10 '20 at 00:29
  • I've added the missing line that defines `allcombos`. Your other comments/questions are too hard to follow without formatting. Edit your original question to show this code and your questions. – dcarlson Nov 10 '20 at 04:11