I am new to the world of Java and Spark, I found out an impressive library for providing C# binding to Spark which allows us to use the C# to work with the SparkSQL.
I have some large amount of process data in one of my custom data store that has an ODBC and OPC interface. We would like to expose this data to Apache Spark
so that we can run analytical queries on this data using tools like Apache Zeppelin
As there is no jdbc interface on my custom store, I was looking at creating c# code to pull the data from the custom data store using the available ODBC interface and provide it to spark using the historyDataFrame.RegisterTempTable("mydata");
I am able to create a sample and query it using the SQL from the C# sample, but what I am unable to understand is how can this be made available to spark such that i can work with tools like Apache Zeppelin
.
Also what is the best way to load large amount of data in to SPARK SQL
, trying to do something like this as in the sample may not work for loading over a million record.
var rddPeople = SparkCLRSamples.SparkContext.Parallelize(
new List<object[]>
{
new object[] { "123", "Bill", 43, new object[]{ "Columbus", "Ohio" }, new string[]{ "Tel1", "Tel2" } },
new object[] { "456", "Steve", 34, new object[]{ "Seattle", "Washington" }, new string[]{ "Tel3", "Tel4" } }
});
var dataFramePeople = GetSqlContext().CreateDataFrame(rddPeople, schemaPeople);
hopping to get some pointers here to get this working.