0

I have converted a .doc document to .txt, and I have some weird formatting that I cannot remove (from looking at other posts, I think it is in Hex code, but I'm not sure).

My data set is a data frame with two columns, one identifying a speaker and the second column identifying the comments. Some strings now have weird characters. For instance, one string originally said (minus the quotes):

"Why don't we start with a basic overview?"

But when I read it in R after converting it to a .txt, it now reads:

"Why don<92>t we start with a basic overview?"

I've tried:

df$comments <- gsub("<92>", "", df$comments)

However, this doesn't change anything. Furthermore, whenever I do any other substitutions within a cell (for instance, changing "start" to "begin", it changes that special character into a series of weird ? that're surrounded in boxes.

Any help would be very appreciated!

EDIT: I read my text in like this:

df <- read_delim("file.txt", "\n", escape_double = F, col_names = F, trim_ws = T)

It has 2 columns; the first is speaker and the second is comments.

Jan
  • 42,290
  • 8
  • 54
  • 79
J.Sabree
  • 2,280
  • 19
  • 48

1 Answers1

-1

I found the answer here: R remove special characters from data frame

This code worked: gsub("[^0-9A-Za-z///' ]", "", a)

J.Sabree
  • 2,280
  • 19
  • 48
  • Glad you found a solution. This code will remove these special characters such as the ' in don't and will be lost from the data - you would be better off trying to change the encoding in R if that is an issue. – Chris May 31 '18 at 19:30