11

I currently wish to divide a data frame into subsets for training/testing. In the data frame there are columns that contain different items, and some contain sub-items like (Aisle01, Aisle02, etc.) I am getting tripped up by filtering out a partial string in multiple columns.

Data sample:

Column1   Column2  Column3

Wall01    Wall04   45.6
Wall04    Aisle02  65.7
Aisle06   Wall01   45.0
Aisle01   Wall01   33.3
Wall01    Wall04   21.1

If my data frame (x) contains two columns that within them contain multiple version of "Aisle", I wish to filter out everything from both columns that contains "Aisle". Wondering if the line below is somewhat on the right track?

filter(x, column1 & column2 == grep(x$column1 & x$column2, "Aisle"))

Desired result:

Column1  Column2  Column3

Wall04   Aisle02  65.7
Aisle06  Wall01   45.0
Aisle01  Wall01   33.3

Thank you in advance.

oguz ismail
  • 1
  • 16
  • 47
  • 69
  • Please provide a reproducable example. You can use `dput` on your data set (or just the first few rows of it) so we can see what you are talking about, and then at the end show what your desired result would be on that data set. – Barker Oct 14 '16 at 00:01

1 Answers1

9

The easiest solution I can see would be this:

x <- x[grepl("Aisle", x[["column1"]]) | grepl("Aisle", x[["column2"]]), ]

Using grepl instead of grep produces a logical so you can use the | operation to select your rows. Also I just wanted to quickly go over a few places in your code that may be giving you trouble.

  1. The x$column1 & x$column2 in the beginning of your grep statement means that the function will try to run the & operation pairwise on each of the entries in column1 and column2. Since these are characters and not logicals, this will produce some weird results.

  2. In grep the pattern you are trying to match comes before the string you are trying to match it to, so it should be grep("Aisle", columnValue) not the other way around. Running ?functionName will give you the information about the function so you don't have to try and figure that out from memory.

  3. filter is a function for time series (ts) objects, not data frames. I am surprised you didn't get an error by using it in this way.

Best of luck. Comment if you want anything clarified.

Barker
  • 2,074
  • 2
  • 17
  • 31
  • This is absolutely what I am looking for. If possible, could I create a column in the original data frame that will show a logical output like (1 or 0) if column1 and/or column2 contains that string? –  Oct 17 '16 at 21:43
  • `x[["isAisle"]] <- grepl("Aisle", x[["column1"]]) | grepl("Aisle", x[["column2"]])` – Barker Oct 18 '16 at 00:36
  • That is helpful, but all the values in "isAisle" are FALSE for me, whereas I am hoping to make the "Aisle" containing rows TRUE while any rows not containing "Aisle" false. Thank you again though! –  Oct 18 '16 at 04:13
  • It works for me. Check your spelling and check your names for the columns. The table you printed has the names as `ColumnX` with a capitol "C" but your code sample had it as `columnX` with a lowercase. If you use `dput` to reproducibly output your data I can look more, but I can't bug anything based on what you provided. – Barker Oct 18 '16 at 17:34
  • x[["isAisle"]] <- grepl("Aisle", x[["Column1"]]) | grepl("Aisle", x[["Column2"]]) –  Oct 18 '16 at 20:10
  • The result of that is giving me the following: –  Oct 18 '16 at 20:14
  • Never mind, it was a syntax issue on my behalf. Your help has been nothing short of spectacular. Apologies if I broke any Stack Overflow etiquette in my question design. –  Oct 18 '16 at 21:11
  • No problem, in the future, following the guidelines in [this post](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) can help make it easier for people to answer your questions. – Barker Oct 19 '16 at 18:30