3

I know of course about reproducible example and piece of code but for this question I have to be (I can't be otherwise) obscure.

I am trying to connect R and Impala. Putting aside the problems ("officially", I cannot install software on this PC... but I have used portable versions of R and RStudio)

I've tried the RImpala package.

rimpala.connect(IP = myip,
              port = the port where Impala sees,
              principal = maybe this is not clear)

I am pretty sure that the causes of my problems is the principal argument, the documentation is not clear to me. Anyway, I've tried several combinations of what the documentations says it should be placed there.

In any case I get the same error:

Error in .jcall("RJavaTools", "Ljava/lang/Object;", "invokeMethod", cl, : java.lang.IllegalArgumentException: Kerberos principal should have 3 parts: 10.60.10.22:8888/impala/@tempuser

I've searched online for this error and it seems to be related to some java things, but I have zero knowledge of that language.

It can be useful to know that I have no access to my PC, say, I cannot install any software or do a thing that only an administrator can do.

I know the question is not well written but as I've said for this time a reproducible example is impossible.

More details

Now that I am thinking about it, I filled the IP argument with the address I saw in the navigation bar of my browser to connect to Hue. I guessed it was the same but maybe I did wrong on this point too anyway as I've said I am pretty sure the error is not due to that.

Sotos
  • 51,121
  • 6
  • 32
  • 66
SabDeM
  • 7,050
  • 2
  • 25
  • 38
  • AFAIK Hue bypasses Kerberos user authentification -- i.e. *you* authenticate against Hue using a Hue login/password, then *Hue* authenticates against Impala using its own Kerberos principal/keytab. So you don't know what connection string to use to access Impala directly, you don't know which login to use against Impala, and you don't know which authentication mechanism to use (password ? Kerberos ticket ?) – Samson Scharfrichter Nov 06 '15 at 15:44
  • By the way, the RImpala docs are quite clear: "IP" and "port" must point to the Impala service, somewher. And "principal" refers to the Kerberos principal of the Impala service -- required if the whole Hadoop cluster is kerberized *(then you must have a Kerberos client on your PC, plus a user/password)*, ignored otherwise – Samson Scharfrichter Nov 06 '15 at 15:49
  • @SamsonScharfrichter thank you for your comment. As IP and port I've used the one I saw in the browser when I connect to Hue, say 10.10.10.20:8888 where the first piece is IP and 8888 is port. I still have no idea if that is correct and if I have (and how) to fill the `principal` argument. – SabDeM Nov 07 '15 at 12:10
  • 1
    The Cloudera documentation for Hue, section *Impala query UI*, states... "you can view the current configuration from the **_Settings_** tab". You should have have a look at that tab. – Samson Scharfrichter Nov 07 '15 at 22:25
  • If your Impala was installed with default values, the port of the Impala is 21050. However, you need to figure out the IP where the Impala was installed. – USchneider Dec 02 '15 at 15:10

5 Answers5

6

Instead of using RImpala package, how about using RJDBC to connect. You can download latest impala JDBC driver jar file from cloudera website: http://www.cloudera.com/downloads/connectors/impala/jdbc/2-5-5.html

then import these files in the jar to R and using them to connect.

install.packages("rJava")

install.packages("DBI")

install.packages("RJDBC")

library(DBI)

library(rJava)

library(RJDBC)

cp <- c(        
           "C:/Users/Cloudera_ImpalaJDBC4_2.5.31/commons-codec-1.3.jar",
           "C:/Users/Cloudera_ImpalaJDBC4_2.5.31/commons-logging-1.1.1.jar",
           "C:/Users/Cloudera_ImpalaJDBC4_2.5.31/hive_metastore.jar",
           "C:/Users/Cloudera_ImpalaJDBC4_2.5.31/hive_service.jar",
           "C:/Users/Cloudera_ImpalaJDBC4_2.5.31/httpclient-4.1.3.jar",
           "C:/Users/Cloudera_ImpalaJDBC4_2.5.31/httpcore-4.1.3.jar",
           "C:/Users/Cloudera_ImpalaJDBC4_2.5.31/libfb303-0.9.0.jar",
           "C:/Users/Cloudera_ImpalaJDBC4_2.5.31/libthrift-0.9.0.jar",
           "C:/Users/Cloudera_ImpalaJDBC4_2.5.31/log4j-1.2.14.jar",
           "C:/Users/Cloudera_ImpalaJDBC4_2.5.31/ql.jar",
           "C:/Users/Cloudera_ImpalaJDBC4_2.5.31/slf4j-api-1.5.11.jar",
           "C:/Users/Cloudera_ImpalaJDBC4_2.5.31/slf4j-log4j12-1.5.11.jar",
           "C:/Users/Cloudera_ImpalaJDBC4_2.5.31/TCLIServiceClient.jar",
           "C:/Users/Cloudera_ImpalaJDBC4_2.5.31/zookeeper-3.4.6.jar"
)

.jinit(classpath<-cp)

drv <- JDBC("com.cloudera.impala.jdbc4.Driver", "C:/Users/Cloudera_ImpalaJDBC4_2.5.31/ImpalaJDBC4.jar")

con <- dbConnect(drv, "jdbc:impala://your_impala_host_address:21050;AuthMech= your authmech number if applicable", "username", "pwd")

data <- dbGetQuery(con, "SELECT * FROM mydb limit 25")

summary(data)
user2100721
  • 3,557
  • 2
  • 20
  • 29
stellar
  • 61
  • 1
  • 6
5

The R package implyr (on CRAN and GitHub) provides a dplyr backend for Impala, using either the ODBC or JDBC driver to connect. See the README for instructions.

ianmcook
  • 537
  • 4
  • 10
5

I've had success using the ODBC connector and the odbc package in R. This method doesn't appear to have any Java dependencies and is recommended by the author of the implyr package. From my limited experience, this connector does a better job of correctly matching R data types to Impala data types, resulting in smaller object sizes within R.

For Macs, the process goes something like,

  • Install the Cloudera ODBC connector
  • Install unixodbc: brew install unixodbc
  • Follow the Cloudera ODBC connector installation guide

    • echo export DYLD_LIBRARY_PATH=$DYLD_LIBRARY_PATH:/opt/cloudera/impalaodbc/lib/universal >> ~/.bash_profile
    • create a ~/.odbcinst.ini file with

      [ODBC Drivers] 
      Cloudera ODBC Driver for Impala=Installed
      [Cloudera ODBC Driver for Impala]
      Driver=/opt/cloudera/impalaodbc/lib/universal/libclouderaimpalaodbc.dylib
      Description=Cloudera ODBC Driver for Impala
      
    • optionally, create a ~/.odbc.ini file with your connection details. Here, I'm using Kerberos:

      [impala]
      Driver = Cloudera ODBC Driver for Impala
      Database = 
      Host =
      Port =
      KrbHostFQDN =
      KrbServiceName =
      KrbRealm =
      AuthMech = 1
      
    • source ~/.bash_profile to ensure that DYLD_LIBRARY_PATH is updated
  • in R, ensure you have DBI and odbc installed: install.packages(c("DBI", "odbc"))
  • Finally, to make a connection in R,

    library(DBI)
    library(odbc)
    conn <- dbConnect(odbc::odbc(),
                   driver = "Cloudera ODBC Driver for Impala",
                   #database = "",
                   host = "",
                   port = ,
                   KrbHostFQDN = "",
                   KrbServiceName = "",
                   KrbRealm = "",
                   AuthMech=1)
    
  • Then, to retrieve something,

    dd <- dbGetQuery(conn, "select * from my_awesome_db.my_awesome_table limit 10;")
    
matmat
  • 875
  • 8
  • 11
1

Just wanted to put in another way to access classpaths rather than writing all the jars:

drv <- JDBC(driverClass = "com.cloudera.impala.jdbc3.Driver", 
        classPath = list.files("C:/Users/Impala",
                               pattern="jar$",full.names=T),
        identifier.quote="'")
Jason Born
  • 39
  • 7
1

Use the RODBC package. I successfully use it in production. Here, I write down a tutorial for it. From this Blog

  1. Download ClouderaImpalaODBC32.msi and install it.
  2. Open it and type the required information you need in the argument fields, here are some screenshots to helps you do it more quickly.
  3. In the R environment, install and library RODBC package.
  4. type,
library(RODBC)
impala <- odbcConnect("Impala")
sqlQuery(impala,"select * from xxx")

By the way, if your environment is under Win 10, in the function odbcConnect, you have to give your username and password (Just from my colleague report).

I hope you successfully use impala by R.

Jiaxiang
  • 865
  • 12
  • 23