0

I am trying to scrape a website link. So far I downloaded the text and set it as a dataframe. I have the folllowing;

keywords <- c(credit | model)

text_df <- as.data.frame.table(text_df)
text_df %>%
  filter(str_detect(text, keywords))

where credit and model are two values I want to search the website, i.e. return row with the word credit or model in.

I get the following error

Error in filter_impl(.data, dots) : object 'credit' not found

The code only returns the results with the word "model" in and ignores the word "credit".

How can I go about returning all results with either the word "credit" or "model" in.

My plan is to have keywords <- c(credit | model | more_key_words | something_else | many values)

Thanks in advance.

EDIT:

text_df:
    Var 1    text
    1        Here is some credit information
    2        Some text which does not expalin any keywords but messy <li> text9182edj </i>
    3        This line may contain the keyword model
    4        another line which contains nothing of use

So I am trying to extract just rows 1 and 3.

user113156
  • 6,761
  • 5
  • 35
  • 81
  • Can't check it now, but `filter_()` should work – MikolajM Oct 05 '17 at 19:06
  • When asking for help you should provide a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output. Generally you need to search specific columns in data.frames for values, not the whole row so it would be better to be more specific here. – MrFlick Oct 05 '17 at 19:19
  • I have created a reduced example if that helps. – user113156 Oct 05 '17 at 19:41

3 Answers3

1

I think the issue is you need to pass a string as an argument to str_detect. To check for "credit" or "model" you can paste them into a single string separated by |.

library(tidyverse)
library(stringr)
text_df <- read_table("Var 1    text
1        Here is some credit information
2        Some text which does not expalin any keywords but messy <li> text9182edj </i>
3        This line may contain the keyword model
4        another line which contains nothing of use")


keywords <- c("credit", "model")
any_word <- paste(keywords, collapse = "|") 
text_df %>% filter(str_detect(text, any_word))
#> # A tibble: 2 x 3
#>     Var   `1`                                    text
#>   <int> <chr>                                   <chr>
#> 1     1               Here is some credit information
#> 2     3       This line may contain the keyword model
markdly
  • 4,394
  • 2
  • 19
  • 27
  • Thanks for the reply! I ran your code on my text file and it worked but the text file I have is significantly more messy than the one I put here so I had the correct results but also some extra noise in the output. (sorry this was my fault!) but it still worked. – user113156 Oct 05 '17 at 21:06
  • @user113156, I'm not exactly sure what you mean by extra noise in the output. You could be more strict with the search. For example `any_word <- paste0("\\b(?:", paste(keywords, collapse = "|"), ")\\b")` will I think only match if the keywords are a standalone word. – markdly Oct 05 '17 at 21:12
  • It gave me the right rows that had the keywords in but what I mean by extra noise is that it also gave me additional HTML output (in different rows) that did not have the requested keywords in, which I do not understand why... – user113156 Oct 05 '17 at 21:14
0

Ok I have checked it and I think it will not work you way, as you must use the or | operator inside filter() not inside str_detect()

So it would work this way:

keywords <- c("virg", "tos")

 library(dplyr)
 library(stringr)

 iris %>%
      filter(str_detect(Species, keywords[1]) | str_detect(Species, keywords[2]))

as a keywords[1] etc you have to specify each "keyword" from this variable

MikolajM
  • 354
  • 1
  • 8
  • I think `iris %>% filter(str_detect(Species, paste(keywords, collapse = "|")))` will achieve the same result. – markdly Oct 05 '17 at 20:29
  • Thanks for your reply, I ran this version replacing the names to correspond to the names of my dataset and it gave pretty good results, it requires a little more work, specifying keywords[3], keywords[4], keywords[x] etc but it works. Thanks again! – user113156 Oct 05 '17 at 21:17
0

I would recommend staying away from regex when you're dealing with words. There are packages tailored for your particular task that you can use. Try, for example, the following

library(corpus)
text <- readLines("http://norvig.com/big.txt") # sherlock holmes
terms <- c("watson", "sherlock holmes", "elementary")
text_locate(text, terms)
##    text           before               instance                after             
## 1  1    …Book of The Adventures of  Sherlock Holmes                             
## 2  27     Title: The Adventures of  Sherlock Holmes                             
## 3  40   … EBOOK, THE ADVENTURES OF  SHERLOCK HOLMES  ***                        
## 4  50                               SHERLOCK HOLMES                               
## 5  77                           To  Sherlock Holmes  she is always the woman. I…
## 6  85   …," he remarked. "I think,      Watson      , that you have put on seve…
## 7  89   …t a trifle more, I fancy,      Watson      . And in practice again, I …
## 8  145  …ere's money in this case,      Watson      , if there is nothing else.…
## 9  163  …friend and colleague, Dr.      Watson      , who is occasionally good …
## 10 315  … for you. And good-night,      Watson      ," he added, as the wheels …
## 11 352  …s quite too good to lose,      Watson      . I was just balancing whet…
## 12 422  …as I had pictured it from  Sherlock Holmes ' succinct description, but…
## 13 504         "Good-night, Mister  Sherlock Holmes ."                          
## 14 515  …t it!" he cried, grasping  Sherlock Holmes  by either shoulder and loo…
## 15 553                        "Mr.  Sherlock Holmes , I believe?" said she.     
## 16 559                     "What!"  Sherlock Holmes  staggered back, white with…
## 17 565  …tter was superscribed to " Sherlock Holmes , Esq. To be left till call…
## 18 567                "MY DEAR MR.  SHERLOCK HOLMES ,--You really did it very w…
## 19 569  …est to the celebrated Mr.  Sherlock Holmes . Then I, rather imprudentl…
## 20 571  …s; and I remain, dear Mr.  Sherlock Holmes ,                           
## ⋮  (189 rows total)

Note that this matches the term regardless of the case.

For your specific use case, do

ix <- text_detect(text, terms)

or

matches <- text_subset(text, terms)
Patrick Perry
  • 1,422
  • 8
  • 17