0

I have a column in a data frame, old_df.

A sample row looks like:

data
trying URL 'https://maps.googleapis.com/maps/api/streetview?&location=13.5146367326733,100.380686367492&size=8000x5333&heading=0&fov=90&pitch=0&key='Content type 'image/jpeg' length 59782 bytes (58 KB)
downloaded 58 KB

Using stopwords, I have removed the words I do not want, and am left with:

data
?&13.5146367326733,100.380686367492
?&13.5162026732673,100.66581378616

stopwords = c('trying',
          'URL', 
          "'",
          '&',
          'location=',
          'https://maps.googleapis.com/maps/api/streetview',
          'size=8000x5333',
          'heading',
          '=0&fov=90&pitch=0&key=',
          'Content', 
          'type',
          'image/jpeg',
          'length', 
          'bytes',
          'KB')

require('tm')
new_df <- as.data.frame(removeWords(old_df$data, stopwords))

However, ?& remains in the data column before the numbers (which I don't want). I try to include ?, & and ?& in stopwords, yet they remain. Any ideas how to delete them?

Indeed, when I include the above combinations within stopwords, I get the error:

PCRE pattern compilation error 'quantifier does not follow a repeatable item' at '?|&|')\b'

1 Answers1

0

Use gsub(). Stopwords only remove "words" that are encased by spaces.

Base R solution:

gsub("^\\?&", "", old_df$data)

stringr solution:

library(stringr)
stringr::str_remove(old_df$data, "^\\?&")
tacoman
  • 882
  • 6
  • 10
  • Thanks, but weirdly these solutions don't delete `?&` in my string. But, using `gsub('?&', '', old_df$data)` does return `?13.5146367326733,100.380686367492`. So now we just need the `?` to go, but again I'm not sure why/how this remains. – HelpMePlease Jun 29 '21 at 15:05
  • You need to escape the questionmark with`\\?`. See my solution. – tacoman Jun 30 '21 at 06:23