How can i perform normal R-functions for hadoop remote on SQL Server?

Question

how can I perform normal R-Code on a SQL Server without using the Microsoft rx-functions? I think the ComputeContext "RxInSqlServer" isn't the right one? But I couldn't find good Information about the other ComputeContext-options.

Is this possible with this Statement?

rxSetComputeContext(ComputeContext)

Or can I only use it to perform rx-functions? An other Option could be to set the Server Connection in RStudio or VisualStudio?

My Problem is: I want analyse data from hadoop via ODBC-Connection on the SQL Server, so I would like to use the performance of the remote SQL Server and not the data in SQL Server. And then I want analyse the hadoop-data with sparklyr.

Summary: I want to use the performance from the remote server and not the SQL Server data. So RStudio should run not local, it should perform and use the memory of the remote server.

Thanks!

I'm sorry but I don't understand (1) where your data you want to analyze is and (2) what machine you want the R processes to run on. — Bob Albright, Apr 21 '17 at 13:55
I want to excecute the data via RStudio remote on a SQL Server (with R) to use the performance (disk space, memory,...) from the remote server. But i don't want to use the SQL Server Database! I connect via ODBC connection on the SQL Server to hadoop. The advantage is: The memory of the remote Server and I think that the excecution could run parallel on an remote server wirh Microsoft R, right? On my locäl machine this isn't possible. — user43348044, Apr 21 '17 at 14:58
You should be able to execute on the remote SQL context as you describe. You'll basically have to an ODBC connection using SQL authentication to the Hadoop database (Polybase?).For this to work you'll want to open "Windows Firewall with Advanced Security" on the SQL Server database. In "Outbound rules" you'll need to disable "Block network access for R local user accounts in SQL Server instance MSSQLSERVER". — Bob Albright, Apr 21 '17 at 15:29
That sound well! But how looks like the rxSetComputeContext-function to connect to the Sql Server to execute R remote? I got already an connection on the SQL Server to hadoop database which works fine. I mean this function: https://msdn.microsoft.com/en-us/microsoft-r/scaler/packagehelp/rxsetcomputecontext — user43348044, Apr 21 '17 at 15:38
I'd *not* make any calls to rxSetComputeContext (basically use the "local" compute context when running inside SQL Server). That will pull data from the current SQL box over ODBC (I think?) into R. — Bob Albright, Apr 21 '17 at 15:59
When I use "local" the R would execute on my local machine. But I want excecute on the remote machine (SQL Server). An the ODBV connection is on the server. The remote machine connect to hadoop an store the data into xdf-files on the disk at the remote server. So that all excetutions are on the remote machine and the resulta get to my local machine. — user43348044, Apr 21 '17 at 16:30
I think you want to call rxExec and dump code in there which will get you a context that's inside SQL Server. This isn't a scenario I've used much myself. You might be better off in general reading up on operationalization (https://msdn.microsoft.com/en-us/microsoft-r/operationalize/remote-execution) it would let you get a "remote shell" but you'll need to install / configure it separate from SQL Server. — Bob Albright, Apr 21 '17 at 16:38
Thats sounds like what I need. I will try it and let you know. Thanks for your help! — user43348044, Apr 21 '17 at 17:28
One last question: In all scripts I read about "RxInSqlServer". What does this exacly, I think its execute the code on the remote server too, right? And then send the results to the local machine. — user43348044, Apr 21 '17 at 22:48
That's a reasonable way to think about it. I'm not sure how MS suggests thinking about it but I think of it as moving computation for *rx* functions (BxlServer) to SQL Server. — Bob Albright, Apr 22 '17 at 12:20

score 2 · Accepted Answer · answered Apr 24 '17 at 20:29

The concept of a compute context in Microsoft R Server is, “Where will the computation be performed?”

When setting compute context, you are telling Microsoft R Server that computation will occur on either the local machine (with either “local” or “localpar” compute contexts), or, the script will be executed on a remote machine which has Microsoft R Server installed on it. Remote compute contexts are defined by creating a compute context object, and then setting the context to that object.

For SQL Server, you would create an RxInSqlServer() object, and then call rxSetComputeContext() on that object. For Hadoop, the object would be created via the RxHadoopMR() call.

In code, it would look something like:

CC <- RxHadoopMR( < context defined here > )
rxSetComputeContext(CC)

To see usage on defining a context, please see documentation (Enter "?RxHadoopMR" in the R Client, no quotes).

Any call to an "rx" function after this will be performed on the Hadoop cluster, with no data being transferred to the client; other than the results.

RxInSqlServer() would follow the same pattern.

Note: To perform any remote computation, Microsoft R Server must be installed on that machine.

If you wish to run a standard R function on a remote compute context, you must wrap that function in a call to rxExec(). rxExec() is desinged as an interface to parallelize any Open Source R function and allow for its execution on a remote context. Please see documentation (enter "?rxExec" in the R Client, no quotes) for usage.

For information on efficient parallelization, please see this blog: https://blogs.msdn.microsoft.com/microsoftrservertigerteam/2016/11/14/performance-optimization-when-using-rxexec-to-parallelize-algorithms/

You called out "without using the Microsoft rx-functions" and I am interpreting this as, "I would like to use Open Source R Algorithms on data in-SQL Server", with Microsoft R Server, you must use rxExec() as the interface to run Open Source R. If you want to use no rx functions at all, you will need to query the data to your local machine, and then use Open Source R. To interface with a remote context using Microsoft R Server, the bare minimum is using rxExec().

This is how you will be able to achieve the first part of your ask, "how can I perform normal R-Code on a SQL Server without using the Microsoft rx-functions? I think the ComputeContext "RxInSqlServer" isn't the right one?"

For your second ask, "My Problem is: I want analyse data from hadoop via ODBC-Connection on the SQL Server, so I would like to use the performance of the remote SQL Server and not the data in SQL Server. And then I want analyse the hadoop-data with sparklyr."

First, I'd like to comment that with the release of Microsoft R Server 9.1, you can use sparklyr in-line with an MRS Spark connection, for some examples, please see this blog: https://blogs.msdn.microsoft.com/microsoftrservertigerteam/2017/04/19/new-features-in-9-1-microsoft-r-server-with-sparklyr-interoperability/

Secondly, what you are trying to do is very involved. I can think of two ways that this is possible.

One is, if you have SQL Server PolyBase, you can configure SQL Server to make a virtual table referencing data in Hadoop, similar to Hive. After you have referenced your Hadoop data in SQl Server, you would use an RxInSqlServer() compute context on these tables. This would analyse the data in SQL Server, and return the results to the client.

Here is a detailed blog explaining an end-to-end setup on Cloudera and SQL Server: https://blogs.msdn.microsoft.com/microsoftrservertigerteam/2016/10/17/integrating-polybase-with-cloudera-using-active-directory-authentication/

The Second, which I would NOT recommend, is untested, hacky, and has the following prereqs:

1) Your Hadoop cluster must have OpenSSH installed and configured 2) Your SQL Server Machine must have the ability to SSH into your Hadoop Cluster 3) You must be able to place an SSH Key on your SQL Server machine in a directory which the R Services process has the ability to access

And I need to add another disclaimer here, there is No Guarantee of this working, and, likely, it will not work. The software was not designed to operate in this fashion.

You would then do the following:

On your client machine, you would define a custom function which contains the analysis that you wish to perform, this can be Open Source R Function, rx functions, or a mix.
In this custom function, before calling any other R or rx functions, you would define a RxHadoopMR compute context object which points to your cluster, referencing the SSH key in the directory on the SQL Server machine as if you were executing from that machine. (in the same way that you would define the RxHadoopMR object if you were to do a remote Hadoop operation from your client machine).
Within this custom function, immediately after RxHadoopMR() is defined, you would call rxSetComputeContext() on your defined RxHadoopMR() object
Still in this custom function, write the actual script which will operate on the data in Hadoop.
After this function is defined, you would define an RxInSqlServer() compute context object on the client machine.
You would set your compute context to RxInSqlServer()
Then you would call rxExec() with your custom function as an input.

What this will do is execute your custom function on the SQL Server machine, which would hopefully cause it to define its compute context as your Hadoop cluster, and pull the data over SSH for analysis on the SQL Server machine; returning the results to client.

With that said, this is not how Microsoft R Server was designed to be used, and if you wish to optimize performance, please use Option One and configure PolyBase.

What a fantastic answer!!! Thank younvery much! Your answer makes Microsoft R clearer to me. I will try it out in the next days and I will follow your blog posts, very intristing. I got a last question about the infrastructure: I read frequently "This would analyse the data in SQL Server, and return the results to the client." What does exacly means? Analyse "in database" in SQL Server or "in memory" of the SQL Server? And where are the data stored with compute context RxInSqlServer? Locally the data is stored in xdf files. — user43348044, Apr 25 '17 at 21:13
@R123456789, Thank you very much! By "This would analyse the data in SQL Server, and return the results to the client.", I mean that, SQL Server with R Services does not have to query the data out of SQL to perform the analysis, MRS has direct access to the data, and will operate within the SQL Server process and return the results to your R Client. With the RxInSqlServer() compute context, the result data will be stored based on how you specify the "outData" parameter of you call to an rx function, this can be an XDF, a Table in SQL, a Data Frame or many other sinks. — Kirill Glushko - Microsoft, Apr 26 '17 at 00:47
Glushko: Can I use rxExec with the RxInSqlServer() compute context? For this I asked a separate question (example): http://stackoverflow.com/questions/43775573/how-to-use-rxexec-to-run-standard-r-remote — user43348044, May 05 '17 at 09:59
@user43348044 I have left a comment on your separate question, please reply there. — Kirill Glushko - Microsoft, May 05 '17 at 21:34

How can i perform normal R-functions for hadoop remote on SQL Server?

1 Answers1