1

I'm importing text from pdf files using pdf_text(). The import picks up some unicode, but I can only see it using the str() function, but not print().

For example, print(x) displays:

"CTO area performance..."

str(x) displays:

"<(u)+F0B7> CTO area performance..." 

(note (u)+F0B7 is really U+F0B7 above)

How can I access the unicode "\\<U+F0B7>" using gsub()? Since it does not seem be in the text, I'm having trouble replacing it with a dash "-". I tried: x <- gsub("<U\\+[0-9A-Z]{4}>", "-", x) but it didn't work.

J-Besna
  • 11
  • 2
  • Try the solutions here: https://stackoverflow.com/questions/38828620/how-to-remove-strange-characters-using-gsub-in-r/50398057#50398057 – acylam Dec 05 '18 at 16:26
  • If you plan to remove any non-word chars from the beginning, you may also use `gsub("^\\W+", "", x)`. BTW, `U+FB07` is an unassigned character in Unicode. – Wiktor Stribiżew Dec 05 '18 at 16:27
  • If this works (it's an `iconv` solution) let us know so we can mark this a as a dup https://stackoverflow.com/questions/24807147/removing-unicode-symbols-from-column-names – hrbrmstr Dec 05 '18 at 16:28
  • To clarify, U+FB07 is a bullet from a bulleted list. I'm looking to replace it with a dash ("-") rather than eliminate it. It's also not just at the beginning, U+FB07 appears throughout the document. – J-Besna Dec 05 '18 at 16:39
  • Try a mere `sub("\\x{FB07}", "=>", x)`, you may even add `fixed=TRUE` argument (no need of regex here). – Wiktor Stribiżew Dec 05 '18 at 16:48
  • Thanks all. No luck yet. gsub("^\\W+", "", x) removed it from the beginning, but not subsequent instances. To clarify, I am not looking to remove it, but rather replace the unicode with a dash "-". Here's the output from the gsub - we can see the second U-F0B7 remains: "CTO area performance and Operational Stability remain strong\r\n " – J-Besna Dec 05 '18 at 18:10
  • So, does `gsub("\\x{FB07}", "-", x)` solve the issue? – Wiktor Stribiżew Dec 05 '18 at 18:26
  • Yes! gsub("\\x{FB07}", "-", x) did the trick. Many thanks. – J-Besna Dec 05 '18 at 18:37

0 Answers0