Removing a URL or any recurring phrase from all rows of a data frame in R

Question

I have the following data frame named bbchealth:

head(bbchealth)
# A tibble: 6 x 1
  Tweets                                                    
  <chr>                                                     
1 Breast cancer risk test devised http://bbc.in/1CimpJF     
2 GP workload harming care - BMA poll http://bbc.in/1ChTBRv 
3 Short people's 'heart risk greater' http://bbc.in/1ChTANp 
4 New approach against HIV 'promising' http://bbc.in/1E6jAjt
5 Coalition 'undermined NHS' - doctors http://bbc.in/1CnLwK7
6 Review of case against NHS manager http://bbc.in/1Ffj6ci

As you can see, each row, which contains a single tweet, has a URL at the end. I would like to remove only this URL while leaving the rest of the data frame unaffected.

If I try to use something like rm_url, I get the following:

[1] "c(\"Breast cancer risk test devised \"GP workload harming care - BMA poll \"Short people's 'heart risk greater' \"New approach against HIV 'promising' \"Coalition 'undermined NHS' - doctors \"Review of case against NHS manager \"\\\"VIDEO: 'All day is empty, what am I going to do?' \"VIDEO: 'Overhaul needed' for end-of-life care \"Care for dying 'needs overhaul' \"VIDEO: NHS: Labour and Tory key policies \"Have GP services got worse? \"A&amp;E waiting hits new worst level \"Parties row over GP opening hours \"Why strenuous runs may not be so bad after all \"VIDEO: Health surcharge for non-EU patients \"VIDEO: Skin cancer spike 'from 60s holidays' \"\.........

That is, a single vector(?) consisting of a string of the tweets with the URLs removed.

The code I used was rm_url(bbchealth, replacement = "").

If I use gsub("http.*","",bbchealth), I get the following output:

[1] "c(\"Breast cancer risk test devised "

However, this is not what I want. I want to retain the columnar structure. That is,

# A tibble: 6 x 1
  Tweets                                                    
  <chr>                                                     
1 Breast cancer risk test devised  
2 GP workload harming care - BMA poll 
3 Short people's 'heart risk greater'  
4 New approach against HIV 'promising' 
5 Coalition 'undermined NHS' - doctors 
6 Review of case against NHS manager

How can I accomplish this?

Jonny Phelps · Answer 1 · 2018-11-12T15:00:23.863

0

Here you go, with stringi package

dt <- data.frame(
  Tweets = c(
    "Breast cancer risk test devised http://bbc.in/1CimpJF ",
    "GP workload harming care - BMA poll http://bbc.in/1ChTBRv",
    "Short people's 'heart risk greater' http://bbc.in/1ChTANp "
  )
)

library(stringi)

dt$Tweets2 <- stringi::stri_replace_all_regex(dt$Tweets, "\\shttp://.*$", "")

edited Nov 12 '18 at 15:00

answered Nov 09 '18 at 16:06

Jonny Phelps

2,687
1
11
20

This partially solves my problem. However, the output of the suggested code is a vector of strings: `[1] "Breast cancer risk test devised" "GP workload harming care - BMA poll" "Short people's 'heart risk greater'"` I can of course convert this into a data frame again. However, is there any function that will take each row of the original data frame, remove the URL, and then leave behind a similar data frame sans the URLs? – Anonymouse Nov 12 '18 at 00:03
Edited my solution. You can create new columns in a data frame, or even replace the Tweet column. I'd suggest reading some online tutorials about using data frames eg https://www.tutorialspoint.com/r/r_data_frames.htm – Jonny Phelps Nov 12 '18 at 15:02

Removing a URL or any recurring phrase from all rows of a data frame in R

1 Answers1