0

So, I've seen some similar questions, but the solutions there don't appear to work, so I assume I've either done something wrong or I don't have exactly the same problem.

I'm using dbReadTable to download data from an SQL server for analysis. The data has a variable which can have currency symbols in a character string, e.g. £. The variable can have one of 9 possible values with the £ symbol appearing 0, 1 or 2 times in various places in the string (I'm not able to share the options here due to data sensitivity) but one example could be...

"[text]£[text]£[text]"

On the SQL db, these display correctly and have class varchar. Once downloaded as a data.frame, the variable has class character. However, the £ symbols now either displaying as "< U+00A3>" (when Viewing the data.frame) or � (when checking the value for an observation). I've tried using gsub and stri_replace_all_fixed to replace the "< U+00A3>" values in the variable, but it doesn't seem to find anything to replace. Below is some code showing what I've done.

con1 <- dbConnect(odbc(), dsn = dsn1, Database = dbase1)
cube1 <- dbReadTable(con1, table1)

cube1$variable <- stri_replace_all_fixed(cube1$variable, "<U+00A3>", "")
cube1$variable <- gsub("^.+<U\\+\\w+>.+$", "", cube1$variable)

Neither of those last 2 lines did anything and I'm still stuck with the "< U+00A3>" or � values (depending on how you look at the values for the variable). As a test, I did check to see if £ symbols display when entered in the Console, and they do...

> x = c("£2")
> x
[1] "£2"

I'm looking to either get the £ symbol to display correctly or to remove it from the variable (or solutions for both options).

Micha Wiedenmann
  • 19,979
  • 21
  • 92
  • 137
Shanagar
  • 77
  • 4
  • It would be helpful to see something from `cube1$variable` through `dput`, at least the relevant parts (due to data sensitivity). – Julius Vainora Oct 23 '18 at 13:57
  • Looks like an encoding issue. Note that `"\u00a3" == "£"` – James Oct 23 '18 at 14:15
  • Unfortunately, our data protection rules prevents me giving any examples of the values for this variable. We take data protection very seriously. If I use `dput`, the variable has the '£' unicode characters < U+00A3> (note: I keep putting spaces before the U because if I don't the entire string disappears). – Shanagar Oct 24 '18 at 09:22
  • So, looking at the encoding for the variable, the values are either "unknown" or "UTF-8" - the former are where the values don't contain "£" symbols and the latter are where they do. Even though they are encoded UTF-8, they still won't appear as £ symbols – Shanagar Oct 24 '18 at 09:43

1 Answers1

1

An easy solution is to not mess with Regex by using fixed=TRUE in your gsub call

test_string <- "< U+00A3>25"

gsub('< U+00A3>','£',test_string,fixed=TRUE)

[1] "£25"
Randall Helms
  • 849
  • 5
  • 15
  • Thanks for the suggestion, but this hasn't worked. I get the following response: – Shanagar Oct 24 '18 at 09:24
  • If you run the code above, does it not work at all for you, or does it only not work when applied to your data from the database? I.e. can you reproduce my example? – Randall Helms Oct 24 '18 at 09:33
  • It might be a display issue on your end actually, that it is still encoded as £ but it just doesn't display properly in your R environment. Try running `gsub('£','GBP ', cube1$variable)` – Randall Helms Oct 24 '18 at 09:38
  • `Error in gsub("", "£", cube1r$variable, fixed = T) : input string 1 is invalid UTF-8`. I've been trying to understand the error, but so far, I don't see why that string would cause a problem – Shanagar Oct 24 '18 at 09:40
  • 1
    This sounds like an import problem, you might need to adjust your data set earlier in the process to alter non UTF-8 characters. Here's a previous question/answer on that topic: https://stackoverflow.com/questions/17291287/how-to-identify-delete-non-utf-8-characters-in-r – Randall Helms Oct 24 '18 at 09:45
  • Excellent, that post worked. Using `iconv(cube1r$variable, "UTF-8", "UTF-8", sub = "")` successfully removed the problem characters. Interestingly, if I sub'd in "£", it still failed to display, but I'm happy with removing them for now and I can see about adding them back in later after. Thanks Randall! – Shanagar Oct 24 '18 at 11:51