0

I have a data frame (df) with numerical values. I would like to write a for loop that iterates through the columns. For each column, I want it to count the number of rows that have values above a number, say 3, then I want it to delete those rows entirely before moving to the next column.

This is what I tried so far:


output <- vector("double", ncol(df))
  for (i in 1:ncol(df)){
  output[[i]] <- length(which(df[i] >= 3))
  df <- df[!df[,i] >= 3, ]
}

But I get the following error:

Error in matrix(if (is.null(value)) logical() else value, nrow = nr, dimnames = list(rn, : length of 'dimnames' [2] not equal to array extent


dput(head(df))

#output:
structure(list(col1 = numeric(0), col2 = numeric(0), (etc.)
NA. = integer(0)), row.names = integer(0), class = "data.frame")

  col1   col2   col3   col4     col5
1 2.09   1.10    0     21.03    0.88
3 0.00   0.00    0     11.71    0.00
4 1.50   1.10    0     1.67     1.76
5 5.10   0.00    0     0.83     17.94
6 0.00   6.34    0     2.10     0.00

In the example above, the final output I am interested in is a vector with the number of rows deleted per column: (1,1,0,2,0).

sf1
  • 25
  • 1
  • 6

2 Answers2

1

Here's a way with a for loop -

dummy_df <- df # dummy_df in case you don't want to alter original df
output <- rep(0, ncol(df)) # initialize output

for(i in 1:ncol(df)) {
  if(nrow(dummy_df) == 0) break # loop breaks if all rows are removed
  if(!any(dummy_df >= 3)) break # loop breaks if no values >= 3 remain
  output[i] <- sum(dummy_df[i] >= 3)
  dummy_df <- dummy_df[dummy_df[i] < 3, , drop = F]
}

output
[1] 3 0 1

Another way with apply which is probably faster than above loop -

# output excludes columns with 0 rows but can be added later if needed
table(apply(df, 1, function(x) match(TRUE, x >= 3)))
1 3 
3 1

Data (Thanks to @Sada93) -

  a  b c
1 1  1 1
2 2  2 5
3 3  3 2
4 4 10 1
5 5  2 1
Shree
  • 10,835
  • 1
  • 14
  • 36
  • That's not quite what I am trying to do. The main information I need is the number of UNIQUE rows with values above 3 for each column. So using the test data provided, for the first column I need to know that 3 rows had values of 3 and above. Then when I go to the next column, I do not want to consider the rows already counted in column 1 which is why I want to delete them. So the output I want is a vector looking like (3, 0, 1) – sf1 Sep 06 '19 at 21:33
  • Thank you! I get an output but it still gives me the same error message, not sure why. The output in the last few points are still NA, not sure if that means there were no rows left for it to check? – sf1 Sep 09 '19 at 16:03
  • Thanks for your help. I still got the same error with the modification – sf1 Sep 09 '19 at 18:50
  • @sf1 Add the output of `dput(head(df))` to your post. Hard to help without looking at the data. – Shree Sep 09 '19 at 19:09
  • 1
    @sf1 There's no data in it?! `df` should be your dataframe that you are testing. – Shree Sep 09 '19 at 19:55
  • Sorry, when I type what you wrote, it doesn't actually display the data, just all the column names saying they are numeric. It is a large data frame, about 1000 columns and 5000 rows, all numeric values between 0 and 500. – sf1 Sep 10 '19 at 21:43
  • @sf1 Just make small dataframe that is similar to your actual dataframe and add it to your post. See [How to make a great R reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) – Shree Sep 10 '19 at 21:48
  • @sf1 You just need to initialize `output` using `0` instead of `NA`. See updated answer. – Shree Sep 11 '19 at 18:13
  • It still gives me the same error. But I figured out the issue! At some point the data frame becomes empty because all the rows have been deleted. So I added this line 'if(nrow(dummy_df) == 0) break' before the other break clause in your post and it fixed the issue. Thanks so much for your help and being patient! – sf1 Sep 11 '19 at 18:51
1

You could do:

Data:
df <- data.frame(x=c(1:5,2),y=c(1,1,1,4,5,2), z= c(2,1,1,2,5,2))

Code:

removed.df <- NULL
for (i in 1:ncol(df)){
  for(j in 1:nrow(df)){
    if(df[j,i] > 3){
      tmp.df <- df[j,]
      tmp.df$index <- j
      removed.df <- rbind(removed.df, tmp.df)
    }
  }
}

# removed.df is the rows you have deleted. Index column shows original rows deleted
removed.df <- removed.df[!duplicated(removed.df$index),]

# now you just remove the rows (index of removed.df) from df.
df[-removed.df$index,]

> df[-removed.df$index,]
  x y z
1 1 1 2
2 2 1 1
3 3 1 1
6 2 2 2
MAPK
  • 5,635
  • 4
  • 37
  • 88