0

I'm cleaning a dataset that doesn't yet have column names (so I'm working with indexes) and I'm trying to filter two columns of a df by piping the results of the first filter into the second and don't understand why the below doesn't work:

stripcols <- c("","Total+")

df <- df %>% 
  filter(!df[,1] %in% stripcols) %>% 
  filter(!df[,2] %in% stripcols)

Running this results in:

Error in filter_impl(.data, quo) : Result must have length 46, not 58

This is easily worked around by running the filter twice, but I don't understand why this didn't work.

I'm also curious as to whether there is a way to do this with one filter command that is applied on both columns rather than two.

ajbentley
  • 193
  • 1
  • 10
  • 6
    Instead of `df[,1]` or `df[,2]`, you should be writing the name of the column. – joran Oct 03 '18 at 15:09
  • In additon to using names instead of column numbers, why not combine the two conditions, something like `df %>% filter(! first %in% stripcols && !second %in% stripcols)` – lebatsnok Oct 03 '18 at 15:17
  • At this point I don't have column names yet (still cleaning the data). I'll update. – ajbentley Oct 03 '18 at 15:22
  • 3
    The columns will have something as names, even if they're just automatic placeholders like X1, X2, etc. Beyond that, it's hard to help in more detail without a [reproducible question](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with a representative sample of your data – camille Oct 03 '18 at 15:24
  • Can it be that there are no parenthesis after !, i.e. !(df[1] %in% stripcols) – karen Oct 03 '18 at 15:25
  • @camille Because I wasn't really looking for a solution to a problem, just curious if there was a reason, I didn't include any play data. I didn't realize that the automatic placeholders would work. I'm adding column names and will see if that changes anything. If not I'll add some data. – ajbentley Oct 03 '18 at 15:34

1 Answers1

1

The source of the error is that you are always comparing against nrow(df) rows regardless of how many rows hit the second filter. For instance:

dat <- data.frame(a=1:10)
dat %>% filter(a > 5)
#    a
# 1  6
# 2  7
# 3  8
# 4  9
# 5 10

The way you're writing it, you're doing

dat %>% filter(dat[,1] > 5)
#    a
# 1  6
# 2  7
# 3  8
# 4  9
# 5 10

For this first call, the number of rows that go into filter is 10, and the number of rows being compared inside filter is also 10. However, if you were to do:

dat %>% filter(dat[,1] > 5) %>% filter(dat[,1] > 7)
# Error in filter_impl(.data, quo) : Result must have length 5, not 10

this fails because the number of rows going into the second filter is only 5 not 10, though we are giving the filter command 10 comparisons by using dat[,1].

(N.B.: many comments about names are perfectly appropriate, but let's continue with the theme of using column indices.)

The first trick is to give each filter only as many comparisons as the data coming in. Another way to say this is to do comparisons on the state of the data at that point in time. magrittr (and therefore dplyr) do this with the . placeholder. The dot is always able to be inferred (defaulting to the first argument of the RHS function, the function after %>%), but some feel that being explicit is better. For instance, this is legal:

mtcars %>%
  group_by(cyl) %>%
  tally()
# # A tibble: 3 x 2
#     cyl     n
#   <dbl> <int>
# 1     4    11
# 2     6     7
# 3     8    14

but an explicit equivalent pipe is this:

mtcars %>%
  group_by(., cyl) %>%
  tally(.)

If the first argument to the function is not the frame itself, then the %>% inferred way will fail:

mtcars %>%
  xtabs(~ cyl + vs)
# Error in as.data.frame.default(data, optional = TRUE) : 
#   cannot coerce class '"formula"' to a data.frame

(Because it is effectively calling xtabs(., ~cyl + vs), and without named arguments then xtabs assumed the first argument to be a formula.)

so we must be explicit in these situations:

mtcars %>%
  xtabs(~ cyl + vs, data = .)
#    vs
# cyl  0  1
#   4  1 10
#   6  3  4
#   8 14  0

(contrived example, granted). One could also do mtcars %>% xtabs(formula=~cyl+vs), but my points stands.

So to adapt your code, I would expect this to work:

df %>% 
  filter(!.[,1] %in% stripcols) %>% 
  filter(!.[,2] %in% stripcols)

I think I'd prefer the [[ approach (partly because I know that tbl_df and data.frame deal with [,1] slightly differently ... and though it works with it, I still prefer the explicitness of [[):

df %>% 
  filter(!.[[1]] %in% stripcols) %>% 
  filter(!.[[2]] %in% stripcols)

which should work. Of course, combining works just fine, too:

df %>% 
  filter(!.[[1]] %in% stripcols, !.[[2]] %in% stripcols)
r2evans
  • 141,215
  • 6
  • 77
  • 149