I have a dataset in which the columns are survey questions and values in the rows contains the answer selected by the responded as well as multiple HTML tags. I'm trying to remove all the HTML tags to be left with just the answer text.
In Excel, this could be accomplished by doing <*>
with an empty string as the replacement. I can't figure out how to do this in R though because the problem I'm having is that I can't get the wildcard to stop after the first greater-than bracket. Instead, it simply recognizes that as being a part of the wildcard and continues on to the end of the string. I've included a toy dataset and my attempt below.
temp <- data.frame(one = c('<b style="font-weight: normal;"><span style="font-size: 12pt; font-family: "Times New Roman";white-space: pre-wrap;">Answer 1</span></b>',
'<b style="font-weight: normal;"><span style="font-size: 12pt; font-family: "Times New Roman";white-space: pre-wrap;">Answer 2</span></b>',
'<b style="font-weight: normal;"><span style="font-size: 12pt; font-family: "Times New Roman";white-space: pre-wrap;">Answer 3</span></b>'),
two = c('<b style="font-weight: normal;"><span style="font-size: 12pt; font-family: "Times New Roman";white-space: pre-wrap;">apples are red</span></b>',
'<b style="font-weight: normal;"><span style="font-size: 12pt; font-family: "Times New Roman";white-space: pre-wrap;">apples are blue</span></b>',
'<b style="font-weight: normal;"><span style="font-size: 12pt; font-family: "Times New Roman";white-space: pre-wrap;">apples are bananas</span></b>'))
temp[] <- sapply(temp, function(x) gsub('<.*>+', "", x))
# what I want the new temp to look like (above code results in empty strings
data.frame(one = c("Answer 1",
"Answer 2",
"Answer 3"),
two = c("apples are red",
"apples are blue",
"apples are bananas
I tried using the code for nth occurence and a few others but it still continues on past the firsts instance to the end of the string.
What's the regex command I'm missing to make it terminate after the first instance? Also, I assume it will it move on to the next row after completing that first removal, thus forcing me to run the gsub()
n number of times where n is the max number of tags in any given column. That's not particularly problematic but is there a workaround for that?