Remove Hex Code from String in R

Question

I have converted a .doc document to .txt, and I have some weird formatting that I cannot remove (from looking at other posts, I think it is in Hex code, but I'm not sure).

My data set is a data frame with two columns, one identifying a speaker and the second column identifying the comments. Some strings now have weird characters. For instance, one string originally said (minus the quotes):

"Why don't we start with a basic overview?"

But when I read it in R after converting it to a .txt, it now reads:

"Why don<92>t we start with a basic overview?"

I've tried:

df$comments <- gsub("<92>", "", df$comments)

However, this doesn't change anything. Furthermore, whenever I do any other substitutions within a cell (for instance, changing "start" to "begin", it changes that special character into a series of weird ? that're surrounded in boxes.

Any help would be very appreciated!

EDIT: I read my text in like this:

df <- read_delim("file.txt", "\n", escape_double = F, col_names = F, trim_ws = T)

It has 2 columns; the first is speaker and the second is comments.

how are you reading the text file into R? I cannot reproduce your problem — Chris, May 31 '18 at 18:27
df <- read_delim("file.txt", "\n", escape_double = F, col_names = F, trim_ws = T) — J.Sabree, May 31 '18 at 18:29
@Dason, I resaved the word document to plain text within Word. — J.Sabree, May 31 '18 at 18:40
can you try running `print.listof( read_delim("file.txt", "\n", escape_double = F, col_names = F, trim_ws = T))` and see if it is encoding correctly? I assume you're using Windows which is why I can't reproduce the error — Chris, May 31 '18 at 18:41
@Chris, thanks for your help, but unfortunately the document still looks the same — J.Sabree, May 31 '18 at 18:44
do you know how your document is encoded? The default in R is UTF-8 — Chris, May 31 '18 at 18:45
@Chris, I'm not sure--is there a way to find that out? If so, I'd be happy to check. — J.Sabree, May 31 '18 at 18:47
@J.Sabree unfortunately not really. You can open the file in notepad and chose UTF-8 when you go to save it. — Chris, May 31 '18 at 18:48
You can choose the encoding when saving as .txt within Word. — Dason, May 31 '18 at 19:16
Hi all! I found the answer here: https://stackoverflow.com/questions/11970891/r-remove-special-characters-from-data-frame — J.Sabree, May 31 '18 at 19:22

score -1 · Answer 1 · answered May 31 '18 at 19:24

-1

I found the answer here: R remove special characters from data frame

This code worked: gsub("[^0-9A-Za-z///' ]", "", a)

answered May 31 '18 at 19:24

J.Sabree

2,280
19
48

Glad you found a solution. This code will remove these special characters such as the ' in don't and will be lost from the data - you would be better off trying to change the encoding in R if that is an issue. – Chris May 31 '18 at 19:30

Remove Hex Code from String in R

1 Answers1