0

I have a dataset in which the columns are survey questions and values in the rows contains the answer selected by the responded as well as multiple HTML tags. I'm trying to remove all the HTML tags to be left with just the answer text.

In Excel, this could be accomplished by doing <*> with an empty string as the replacement. I can't figure out how to do this in R though because the problem I'm having is that I can't get the wildcard to stop after the first greater-than bracket. Instead, it simply recognizes that as being a part of the wildcard and continues on to the end of the string. I've included a toy dataset and my attempt below.

temp <- data.frame(one = c('<b style="font-weight: normal;"><span style="font-size: 12pt; font-family: "Times New Roman";white-space: pre-wrap;">Answer 1</span></b>',
                         '<b style="font-weight: normal;"><span style="font-size: 12pt; font-family: "Times New Roman";white-space: pre-wrap;">Answer 2</span></b>',
                         '<b style="font-weight: normal;"><span style="font-size: 12pt; font-family: "Times New Roman";white-space: pre-wrap;">Answer 3</span></b>'),
                   two = c('<b style="font-weight: normal;"><span style="font-size: 12pt; font-family: "Times New Roman";white-space: pre-wrap;">apples are red</span></b>',
                         '<b style="font-weight: normal;"><span style="font-size: 12pt; font-family: "Times New Roman";white-space: pre-wrap;">apples are blue</span></b>',
                         '<b style="font-weight: normal;"><span style="font-size: 12pt; font-family: "Times New Roman";white-space: pre-wrap;">apples are bananas</span></b>'))


temp[] <- sapply(temp, function(x) gsub('<.*>+', "", x))

# what I want the new temp to look like (above code results in empty strings
data.frame(one = c("Answer 1", 
                   "Answer 2", 
                   "Answer 3"),
           two = c("apples are red",
                   "apples are blue", 
                   "apples are bananas

I tried using the code for nth occurence and a few others but it still continues on past the firsts instance to the end of the string.

What's the regex command I'm missing to make it terminate after the first instance? Also, I assume it will it move on to the next row after completing that first removal, thus forcing me to run the gsub() n number of times where n is the max number of tags in any given column. That's not particularly problematic but is there a workaround for that?

cparmstrong
  • 799
  • 6
  • 23

2 Answers2

1

Check out this excerpt from the regex documentation:

By default repetition is greedy, so the maximal possible number of repeats is used. This can be changed to ‘minimal’ by appending ? to the quantifier. (There are further quantifiers that allow approximate matching: see the TRE documentation.)

temp[] <- sapply(temp, function(x) gsub('<.*?>', "", x))

       one                two
1 Answer 1     apples are red
2 Answer 2    apples are blue
3 Answer 3 apples are bananas

To answer your second concern, gsub will replace all matches(as opposed to sub, which only replaces the first match) - so you should be alright.

zack
  • 5,205
  • 1
  • 19
  • 25
  • why use sapply? `gsub` is vectorized and does maintain the dimension. just do `gsub('<.*?>', "", as.matrix(temp))` – Onyambu Sep 08 '18 at 00:36
1

With str_extract, we can extract word characters and spaces between > and <:

library(stringr)
library(dplyr)

temp %>%
  mutate_all(str_extract, "(?<=\\>)[\\w\\s]+(?=\\<)")

Output:

       one                two
1 Answer 1     apples are red
2 Answer 2    apples are blue
3 Answer 3 apples are bananas
acylam
  • 18,231
  • 5
  • 36
  • 45