2

I am new to the world of Java and Spark, I found out an impressive library for providing C# binding to Spark which allows us to use the C# to work with the SparkSQL.

I have some large amount of process data in one of my custom data store that has an ODBC and OPC interface. We would like to expose this data to Apache Spark so that we can run analytical queries on this data using tools like Apache Zeppelin

As there is no jdbc interface on my custom store, I was looking at creating c# code to pull the data from the custom data store using the available ODBC interface and provide it to spark using the historyDataFrame.RegisterTempTable("mydata");

I am able to create a sample and query it using the SQL from the C# sample, but what I am unable to understand is how can this be made available to spark such that i can work with tools like Apache Zeppelin.

Also what is the best way to load large amount of data in to SPARK SQL, trying to do something like this as in the sample may not work for loading over a million record.

    var rddPeople = SparkCLRSamples.SparkContext.Parallelize(
                            new List<object[]>
                            {
                                new object[] { "123", "Bill", 43, new object[]{ "Columbus", "Ohio" }, new string[]{ "Tel1", "Tel2" } },
                                new object[] { "456", "Steve", 34,  new object[]{ "Seattle", "Washington" }, new string[]{ "Tel3", "Tel4" } }
                            });

    var dataFramePeople = GetSqlContext().CreateDataFrame(rddPeople, schemaPeople);

hopping to get some pointers here to get this working.

zero323
  • 322,348
  • 103
  • 959
  • 935
Kiran
  • 2,997
  • 6
  • 31
  • 62

1 Answers1

0

You could dump the data in csv format and let Spark/SparkCLR load that data for Spark SQL analysis. Loading the data from csv files will have the same result as parallelize in your code except that it will have much better performance. This approach will work for you if the data in your custom SQL source is append-only with no updates to existing data. If your custom source allows updates, the csv dump will go stale and you need a way to keep it fresh before doing analytics. An alternative is to explore if a JDBC-ODBC bridge can be employed to directly connect Spark SQL to your custom source obviating the need for dumping data in csv format.

skaarthik
  • 377
  • 2
  • 6
  • what would be the best way to integrate the Dataframe created here with `SPARK` so that it can be used in the `Zeppeline`. Also is it possible to append the data in the rdd, i.e. add more people once rdd is created. – Kiran Jan 06 '16 at 03:38
  • For incremental data, you just need to dump to a new csv file and add to the same location. When you load your RDD from this location next time, you will get the entire data loaded into RDD. Zeppelin support for SparkCLR is being investigated. For this scenario, you may not need C# binding available in SparkCLR and you can use Zepplin with Scala if you would like. – skaarthik Jan 06 '16 at 18:15