-1

I have been trying to solve this issue but couldn't succeed whatever I tried, the solutions found on internet and this site didn't work either.

I have these kind of datasets with more than 500k rows. these kind of datasets

Example subset:

subset= as.data.frame(matrix(c(9,9,9,0,2,9,0,9,9,1,0,2,9,9,9,0,0,0,2,2,2,1,1,1),ncol = 3, byrow = T))

Every column is an individual, every row is a certain marker, with "0,1,2" meaning it is not missing data for that row (of course with other meanings but not necessary here to explain) and "9" meaning it is missing data for that row. I am going to write numbers as with quotation marks to keep it clear to see, but it is numeral in the dataset.

What I am trying to do is counting the rows where at least one of the samples is not missing. So, in the rows where it is all consisted of "9"s, the counter will not increase. If at least one cell is not 9 in that certain row, the counter will increase by one.

After trying for some time, I wrote this code:

counter=0

test = apply(subset, 1,  function(i) {
  if(length(which(subset[i,] !=9)) != 0){
    counter=counter+1
  }
  print(counter)
  assign("counter",counter,envir = .GlobalEnv)
})

When I do this, the counter doesn't increase when the only cell/or cells that are not "9" are integer(0). For example, in the picture I uploaded, the 9th row consists of many "9"s and an integer(0). The counter won't increase in this row but I have to count it, too.

In order to overcome this, I tried different things including;

1- Placing identical(length(which(dummy[i,] ==0)), integer(0)) , all() functions in various places in the loop, and tried various if else statements. I also tried various ways that I don't remember all, trying to count integer(0).

2- Changing 9's into NA / changing integer(0)'s into another number such as 3. These both changed the mechanism of the loop, and now regardless of the cells in the row, the counter increases by one.

3- Using the if conditional with ( condition < 9*ncol(subset) ), which I thought would give the result (if any of them is not missing/9 it will be less than 9*ncol), but again R sees it as integer(0) and nothing changes.

4- Trying to find where the result is "zero" won't work because the code I wrote in the beginning gives the same result for the missing data "9"s as well (zero). I only want the missing results out of the counter.

If anybody can help regarding this issue, it will be highly appreciated. As stackoverflow wants to keep comment section clean from thank messages, I want to say thanks to everybody in advance.

Yohussub
  • 23
  • 4
  • 1
    Please provide a [minimum reproducible dataset](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). – Adam Quek May 29 '22 at 09:19
  • 1
    Do you want a zero if any of the cells in a row contains a nine and a one if they don't OR do you want to count the number of all non-nine values? – Jakub Jędrusiak May 29 '22 at 09:33
  • Thank you both, I added a small code for an example subset, I am not a professional so maybe it could be a better one but this works. – Yohussub May 29 '22 at 09:44
  • I want to calculate the rows with at least one non-nine value. – Yohussub May 29 '22 at 09:44

2 Answers2

1

As I understand, you want to count the number of rows where there is at least one value different to 9. There are many ways of doing this, under are two alternatives.

With dplyr

You can do this with dplyr like this:

library(dplyr)

# Your provided data
subset %>% 
  filter(if_any(everything(), ~ .x != 9)) %>% 
  nrow()
#> [1] 6

Created on 2022-05-29 by the reprex package (v2.0.1)

In filter(if_any(everything(), ~ .x != 9)), filter() removes the rows where at least one value is not equal to 9. After, we just count the rows.

With apply()

If you want to use apply you can do the following:

sum(
  apply(
    subset, 
    MARGIN = 1, 
    function(x) {
      any(x != 9)
    }
  )
)
#> [1] 6

Created on 2022-05-29 by the reprex package (v2.0.1)

Here, I iterate over each row of subset with apply() and check whether any values of that row is unequal to 9. This returns a vector of TRUE/FALSE. We sum() this vector to find the total number of rows with at least one value different to 9.

jpiversen
  • 3,062
  • 1
  • 8
  • 12
  • Thank you, but it seems if_any function doesn't work with my dplyr, I don't know why. My assignment was also doing it with the apply family. Do you have any suggestions about these both? – Yohussub May 29 '22 at 11:07
  • 1
    `if_any()` will probably work if you update dplyr. Just install the package again with `install.packages('dplyr')`. I have added an `apply()` alternative to my answer. – jpiversen May 29 '22 at 11:33
  • I updated the packages and filtering the packages worked after that. Thank you for your explanation of doing it both ways. – Yohussub May 29 '22 at 12:05
1

That's the option I find the easiest to understand. You can create an additional column counter with value based on the other variables. The case_when function checks values of your columns and if it finds a 9, it puts a 0 in the counter column. If it doesn't find a 9 in any of your columns, it returns a 1. You can then sum your counter column to check the overall number of rows without nines.

library(dplyr)
subset <- as.data.frame(matrix(c(9, 9, 9, 0, 2, 9, 0, 9, 9, 1, 0, 2, 9, 9, 9, 0, 0, 0, 2, 2, 2, 1, 1, 1), ncol = 3, byrow = T))
subset <- subset %>%
  mutate(counter = case_when(
    V1 == 9 ~ 0,
    V2 == 9 ~ 0,
    V3 == 9 ~ 0,
    TRUE ~ 1
  ))
number_of_full_rows <- sum(subset$counter)

If you're sure you understand the basic version, you can shorten it so you don't have to name all of your columns.

library(dplyr)
subset <- as.data.frame(matrix(c(9, 9, 9, 0, 2, 9, 0, 9, 9, 1, 0, 2, 9, 9, 9, 0, 0, 0, 2, 2, 2, 1, 1, 1), ncol = 3, byrow = T))
subset <- subset %>%
  mutate(counter = case_when(
    if_any(.fns = ~ .x == 9) ~ 0,
    TRUE ~ 1
  ))
number_of_full_rows <- sum(subset$counter)
Jakub Jędrusiak
  • 301
  • 1
  • 12
  • Thank you, but it seems if_any function doesn't work with my dplyr, I don't know why. My assignment was also doing it with the apply family. Do you have any suggestions about these both? – Yohussub May 29 '22 at 11:07
  • 1
    @Yohussub Try running `update.packages()` before running the code. If it still doesn't work, use the first variant I've proposed. – Jakub Jędrusiak May 29 '22 at 11:29
  • 1
    It worked after I updated the packages! Many thanks again. – Yohussub May 29 '22 at 12:02