Combining multiple conditions to filter out rows in R

Question

So, this is a rookie question.

I've got a data frame which includes all responses of an online survey, totaling in 89 columns.

As online surveys are sometimes filled out by people who do not really care and who simply put in response values that are easy to fill in, I would like to filter out implausible rows where someone simple hit an extreme value over and over again.

I would like to filter out rows where these columns ALL have the value "9", or ALL have the value "1":

'sociald_ties_strong'
'sociald_ties_weak'
'sociald_ties_secondorder'
'sociald_identity_lifestyle'
'sociald_identity_politics'
'sociald_vertical_socialcapital'
'sociald_vertical_networkcapital'

So this is my code (including the tidyverse and dplyr packages):

data-cleaned <- data_raw %>% 
         fdaten_bereinigt <- rohdaten_basis %>% 
     filter(sociald_ties_strong == 9 & sociald_ties_weak == 9 & sociald_ties_secondorder == 9 & sociald_identity_lifestyle == 9 & sociald_identity_politics == 9 & sociald_vertical_socialcapital == 9 & sociald_vertical_networkcapital == 9)  %>%

     filter(sociald_ties_strong == 1 & sociald_ties_weak == 1 & sociald_ties_secondorder == 1 & sociald_identity_lifestyle == 1 & sociald_identity_politics == 1 & sociald_vertical_socialcapital == 1 & sociald_vertical_networkcapital == 1)

But, I seem to be missing something in my logic and / or syntax, as this filters out way to many rows.

My data cleaning would include more conditional strings like the above to exclude faulty or automated rows, but first I want to learn how to get it right.

Possibly two (or more) filters piped together is not a good idea the way I did it? Any suggestions welcome!

I expected to filter out the few rows that meet all the conditions, possibly 0.5% of the total observations.

Can you provide `dput(data_raw)`? Also please avoid putting hyphens in the names of objects or variables: https://stackoverflow.com/questions/48651370/dash-in-column-name-yields-object-not-found-error — jrcalabrese, Jan 18 '23 at 14:53
It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. — MrFlick, Jan 18 '23 at 14:56
Ok, I will keep this in mind. Thanks for looking into it, anyways! — urcho, Jan 18 '23 at 16:52

zephryl · Answer 1 · 2023-01-18T15:29:53.143

There are a few issues with your code. First, you have two assignments in your pipeline:

cleaned <- data_raw %>% 
         fdaten_bereinigt <- rohdaten_basis %>% 
    # ...

I’m assuming this is an error; otherwise, it’s not clear what you’re trying to do.

Second, filter() keeps values meeting the condition and removes everything else. So currently your first filter() keeps only rows where all values are 9. Then your second filter() looks for only rows where all values are 1 - of which there are none, since these were removed by the previous step - so you end up with an empty dataframe.

To get what you want, negate the condition by wrapping in !():

library(dplyr)

fdaten_bereinigt <- rohdaten_basis %>% 
  filter(!(sociald_ties_strong == 9 & sociald_ties_weak == 9 & sociald_ties_secondorder == 9 & sociald_identity_lifestyle == 9 & sociald_identity_politics == 9 & sociald_vertical_socialcapital == 9 & sociald_vertical_networkcapital == 9))  %>%
  # ...

Third, you can do this more succinctly using dplyr::if_all():

fdaten_bereinigt <- rohdaten_basis %>% 
  filter(
    !if_all(sociald_ties_strong:sociald_vertical_networkcapital, \(x) x == 9),
    !if_all(sociald_ties_strong:sociald_vertical_networkcapital, \(x) x == 1)
 )

score 1 · Answer 2 · answered Jan 18 '23 at 15:11

You could create a vector of your variables that you want to check. Then use filter(if_all(all_of(VECTOR), ~.x != 1) & if_all(all_of(VECTOR), ~.x != 9))

qs <- c("A", "B", "C")

dat <- data.frame("A" = 1:9,
                  "B" = 1:9,
                  "C" = 1:9)
  A B C
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 5
6 6 6 6
7 7 7 7
8 8 8 8
9 9 9 9

dat |> 
  filter(if_all(all_of(qs), ~.x != 1) & if_all(all_of(qs), ~.x != 9))
  A B C
1 2 2 2
2 3 3 3
3 4 4 4
4 5 5 5
5 6 6 6
6 7 7 7
7 8 8 8

Thanks for taking the time to create this more formal version. The if_all function seems like the way to go! I will try it out tomorrow. — urcho, Jan 18 '23 at 16:55

score 0 · Answer 3 · answered Jan 18 '23 at 15:24

When it's hard to figure out what to do in dplyr, often it's easier to reshape the data into a different form.

You can use pivot_longer to put the data in a long form:

dat <- data.frame("A" = 1:9,
                  "B" = 1:9,
                  "C" = 1:9)

dat_long <- 
dat %>% 
    mutate(id = row_number()) %>% 
    pivot_longer(
        cols = c(A, B, C),
        names_to = 'column',
        values_to = 'value'
    )

> dat_long
# A tibble: 27 × 3
      id column value
   <int> <chr>  <int>
 1     1 A          1
 2     1 B          1
 3     1 C          1
 4     2 A          2
 5     2 B          2
 6     2 C          2
 7     3 A          3
 8     3 B          3
 9     3 C          3
10     4 A          4
# … with 17 more rows

Now it's reasonably straight forward with a group by and a filter:

dat_filtered <- 
dat_long %>% 
    group_by(id) %>% 
    filter(!all(value == 1)) %>% 
    filter(!all(value == 9))

> dat_filtered
# A tibble: 21 × 3
# Groups:   id [7]
      id column value
   <int> <chr>  <int>
 1     2 A          2
 2     2 B          2
 3     2 C          2
 4     3 A          3
 5     3 B          3
 6     3 C          3
 7     4 A          4
 8     4 B          4
 9     4 C          4
10     5 A          5
# … with 11 more rows

And you can finish by moving the data back to the original form:

dat_filtered %>% 
    pivot_wider(
        names_from = column,
        values_from = value
    )

 A tibble: 7 × 4
# Groups:   id [7]
     id     A     B     C
  <int> <int> <int> <int>
1     2     2     2     2
2     3     3     3     3
3     4     4     4     4
4     5     5     5     5
5     6     6     6     6
6     7     7     7     7
7     8     8     8     8

Thanks! I am sure this is a fantastic solution. I have to admit I feel somewhat overwhelmed with this at this stage. I will try out the if_all approach suggested above and then possibly come back to this. — urcho, Jan 18 '23 at 16:59

Combining multiple conditions to filter out rows in R

3 Answers3