1

I have the BigQuery Dataset with Reddit Comments. It has multiple columns, one which is the body column with the actual comment. I now want to search for a certain word, like a brand mention, for instance "BMW" in the body column and create a subset of the rows which contain "BMW" in data$body.

The dataset looks similar to this:

str(data)
data.frame: 75519 obs. of 113 variables
$ body: chr "...." .....
$ name: Factor w/ 22805 levels ....
....

I know the SQL command, which looks like this

SELECT * FROM dataset
WHERE body contains "BMW"

Is there a similar command in R?

Thank you very much!

EDIT: Solutions is

 bmw <- data[grep("BMW", data$body),]

Thanks to charleslmh

Arthur Pennt
  • 155
  • 1
  • 14
  • Possible duplicate of [Test if characters in string in R](http://stackoverflow.com/questions/10128617/test-if-characters-in-string-in-r) – James Elderfield Jul 20 '16 at 09:03
  • I just tried grepl("BMW", data$body) which gives me just Boolean expressions. I would like to have the rows, containing "BMW" in data$body in a subset. Do you know how to do that? – Arthur Pennt Jul 20 '16 at 09:07
  • Can i use these numerical positions of grep to make a subset out of the original dataframe? In the end i want to have a new dataset, where the body column contains "BMW", with all the other columns of the original dataset. – Arthur Pennt Jul 20 '16 at 09:17
  • 2
    I guess `data[grep("BMW",data$body),]` could work. – Charleslmh Jul 20 '16 at 09:40
  • Great, that worked! Thank you!! – Arthur Pennt Jul 20 '16 at 10:37
  • If there's a solution, please post it as an answer. It's better for other users and the site in general. – catastrophic-failure Jul 20 '16 at 11:56
  • `grep` gives a probably shorter vector of numerical positions of matches. `grepl` give a vector of TRUE and FALSE of the same length as its 2nd argument. `grepl is very useful when doing selections with `[` or `[[`. – IRTFM Jul 20 '16 at 16:45

2 Answers2

2

The solution is

bmw <- data[grep("BMW", data$body),]

Thanks to charleslmh

Arthur Pennt
  • 155
  • 1
  • 14
1

Either of these would succeed:

bmw <- data[ grep("BMW", data$body), ]  # numerical indexing
bmw <- data[ grepl("BMW", data$body), ] # logical indexing

The second one will succeed because the "[" function selects rows where logical vectors are TRUE in the "i" (the first) position.

IRTFM
  • 258,963
  • 21
  • 364
  • 487