I have a data frame of twitter tweets in one column which have various unicodes throughout the text. i.e. not at the beginning or end, but randomly throughout. I want to only remove all the Unicodes from the text
column and preserve the data frame. For instance if one observation is: text text <U+FFH5> text text <U+301F> text
I would like it to return: text text text text text
I have attempted:
twitter <- str_replace_all(twitter,"<U+[[:alnum:]]>","")
twitter <- gsub("\\s*<U\\+\\w+>$","",twitter)
As well as:
twitter$text <- str_replace_all(twitter$text,"<U+[[:alnum:]]>","")
twitter$text <- gsub("\\s*<U\\+\\w+>$","",twitter$text)
They do not preserve the data frame.
my data frame currently looks like :
id text
AA Some text<U+FFFD>with some <U+671F> done
HH <U+3010><U+5B9A><U+671F>good news
AA Something<U+FFFD><U+FFFD>and so on
BB Nothing at <U+3011>
AA more<U+30C8>example
Which I would like to convert to:
id text
AA Some text with some done
HH good news
AA Something and so on
BB Nothing at
AA more example
Thanks in advance for any help.