0

So, I've been struggling with this for a while now and can't seem to google my way out of it. I'm trying to read a .sql file into R, I always do that to avoid putting 100+ lines of sql in my R scripts. I usually do this:

library(tidyverse)
library(DBI)

con <- dbConnect(<CONNECTION ARGUMENTS>)

query <- read_file("path/to/script.sql")
df <- as_tibble(dbGetQuery(con, query))

dbDisconnect(con)

However, this time my sql script has some Spanish characters in it. Say something like this:

select tree_id, tree
from forest.trees
where species = 'árbol'

When I read this script into R and make the query it just doesn't return anything, but if I copy and paste the sql script into an R string it works! So it seems that the problem is in the line where I read the script into R.

I tried changing the string's encoding in a couple of ways:

# none of these work
query <- read_file("path/to/script.sql")
Encoding(query) <- "latin1"

query <- readLines("path/to/script.sql", encoding = "latin1")
query <- paste0(query, collapse = " ")

Unfortunately I don't have a public database to offer to anyone reading this. I'm connecting to a postgreSQL 11 database.

--- UPDATE ----

I'm on a windows 10 machine, with US locale.

When I use the read_file function the contents of query look ok, the Spanish characters print out like they should, but when I pass it to dbGetQuery it just doesn't fetch anything.

I tried forcing encoding "latin1" because I found online that Spanish characters tend to fix in R when doing that. When doing this, the Spanish characters print out wrong, so I didn't expected it to work, and it didn't.

The character values in my database have 'utf-8' encoding.

Just to be completely clear, all my attempts to read the .sql script haven't worked, however this does work:

library(tidyverse)
library(DBI)

con <- dbConnect(<CONNECTION ARGUMENTS>)

query <- "select tree_id, tree from forest.trees where species = 'árbol'"
# df actually has results
df <- as_tibble(dbGetQuery(con, query))

dbDisconnect(con)
Kean Rawr
  • 33
  • 1
  • 6
  • What encoding did you use on that file that contains non-ASCII characters? Are you sure that it's latin-1? Are you on Windows? Does the contents of `query` look correct? How are character values encoded in your data base? – MrFlick Apr 13 '20 at 23:54
  • @MrFlick I updated the question with all the information, thanks for the help! – Kean Rawr Apr 14 '20 at 01:02
  • Well, if you take the `query` from the chunk that works, and then do `write_lines(query, "test.sql")` and then read it back in and run it with `query <- read_lines("test.sql"); dbGetQuery(con, query)` does it work? On windows everything is latin-1 by default. You will probably need to convert to utf-8 to work with the database connection. What does `Encoding(query)` return before you try to set it "latin-1". – MrFlick Apr 14 '20 at 01:11
  • @MrFlick I actually already tried that and it didn't work, I didn't post it as something I tried because it sounded dumb when I was writing the question :P – Kean Rawr Apr 14 '20 at 01:14
  • `Encoding(query)` returns "UTF-8" – Kean Rawr Apr 14 '20 at 01:21

2 Answers2

0

The encoding statement is telling R how to interpret the filename, not its contents. Try this instead:

filetext <- readLines(file("path/to/script.sql", encoding = "latin1"))

See this answer for more details:R: can't read unicode text files even when specifying the encoding

Paul Smith
  • 454
  • 6
  • 11
  • that also didn't work, `dbGetQuery` doesn't return anything :/ The question that you're referring to only deals with reading the characters into R, which I have no problem doing using `read_file`. The problem is when I try to send that string as a query. – Kean Rawr Apr 14 '20 at 01:07
0

So after some time to think about it, I wondered why the solution proposed by MrFlick didn't work. I checked the encoding of the file created by this chunk:

query <- "select tree_id, tree from forest.trees where species = 'árbol'"
write_lines(query, "test.sql")

After checking what encoding did test.sql have, it turned out it was ANSI, but it didn't look right. So I manually changed my original script.sql encoding to ANSI. After that it worked totally fine.

This solution however, didn't work when I cloned my repo on an ubuntu environment. In ubuntu there was no problem with the original 'utf-8' encoding.

Hope this helps anyone dealing with this in windows.

Kean Rawr
  • 33
  • 1
  • 6