0

I have a data frame of twitter tweets in one column which have various unicodes throughout the text. i.e. not at the beginning or end, but randomly throughout. I want to only remove all the Unicodes from the text column and preserve the data frame. For instance if one observation is: text text <U+FFH5> text text <U+301F> text I would like it to return: text text text text text

I have attempted:

twitter <- str_replace_all(twitter,"<U+[[:alnum:]]>","") 

twitter <- gsub("\\s*<U\\+\\w+>$","",twitter)

As well as:

twitter$text <- str_replace_all(twitter$text,"<U+[[:alnum:]]>","") 

twitter$text <- gsub("\\s*<U\\+\\w+>$","",twitter$text)

They do not preserve the data frame.

my data frame currently looks like :

id    text
AA    Some text<U+FFFD>with some <U+671F> done
HH    <U+3010><U+5B9A><U+671F>good news
AA    Something<U+FFFD><U+FFFD>and so on
BB    Nothing at <U+3011>
AA    more<U+30C8>example

Which I would like to convert to:

id    text
AA    Some text with some  done
HH    good news
AA    Something and so on
BB    Nothing at
AA    more example

Thanks in advance for any help.

Mike G
  • 533
  • 4
  • 12
  • Please provide a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). Are you sure the string "" is literally in the text? Or are you using a viewer that is escaping the non-ascii character to make it printable. – MrFlick Oct 16 '17 at 13:15
  • Which client are you using? That's almost certainly ONE Unicode character that doesn't need replacing. The only reason you see it this way is that the client can't dispaly non-ASCII characters properly, or the settings specify to display the encoded value instead of the character itself. For example R Studio needs to be configured to use UTF8 for reading/writing *and* displaying – Panagiotis Kanavos Oct 16 '17 at 13:29
  • 1
    Another warning - `U+FFFD` corresponds to the Unicode replacement character. It appears when the code tries to load text that's stored in one codepage using another, incompatible one. That data is *lost*. Where does this text come from? How is it generated, saved, read ? – Panagiotis Kanavos Oct 16 '17 at 13:33

1 Answers1

0

Perhaps something like this (partly based on Remove all text between two brackets):

twitter ="text <> text <U+FFH5> text text <U+301F> text"

str_replace_all(twitter,"\\<U[^\\>]*\\>","") # only removes unicode
timfaber
  • 2,060
  • 1
  • 15
  • 17
  • I just attempted to apply this to my data frame `twitter' as: `twitter <- str_replace_all(twitter,"\\]*\\>","")` and it reduced my data frame into a `Large character (2 elements, 3.5 Mb)` – Mike G Oct 16 '17 at 13:27
  • As someone mentioned in the comments, are you **sure** this character sequence exists? That it isn't how R presents a *single* Unicode character> – Panagiotis Kanavos Oct 16 '17 at 13:27
  • Not sure how you are trying to replace the values. Are you changing `twitter$text = str_replace_all(twitter$text,"\\]*\\>","")` – timfaber Oct 16 '17 at 13:36