0

Sorry if this is a stupid question, but I tried searching for similar problems and did not find what I was looking for.

I scraped some text from Internet and now try to work with it in R. I encountered a problem: there are unknown characters inserted in the middle of some words. It looks normal when I just display the table, but when I copy the text there is this symbol. For example, if the cell in the table is "Example", when I copy it to the console, I see this:

img

This unfortunately is problematic as R does not recognize the word in these cases and would not find the cell if I, for example, would try to find all cells that contain the word "Example". As the error seems random and doesn't just apply to specific words I do not know how to fix it - can anybody help me?

Thank you very much in advance!!

Artem
  • 3,304
  • 3
  • 18
  • 41
Matt
  • 11
  • Welcome to Stack Overflow! You may want to check out [How to make a great R reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). In particular, can you show the code you used to scrape the text, and what you've tried to deal with your problem? Without some code to clue in people who are trying to help, it makes it harder for us to help you. – duckmayr Sep 29 '18 at 12:26
  • As an offhand guess, does `gsub("\U00B7", "", x)` help at all (where `x` is the variable with your problematic text)? – duckmayr Sep 29 '18 at 12:29
  • Assuming "from the Internet" means an HTML page (from the World-Wide Web), all characters are [Unicode](http://www.unicode.org/charts/nameslist/index.html). So, now no characters are "unknown" to you. Please [edit] your question to include the character in question. – Tom Blodget Sep 29 '18 at 19:30

1 Answers1

0

You can use iconv function to remove all non-ASCII characters from the string. Please see the example below:

iconv("Ex·ample", from = "UTF-8", to = "ASCII", sub = "")
# Example
Artem
  • 3,304
  • 3
  • 18
  • 41