I am trying to use Spark-SQL
to read and select data from a JSON string.
Here is what I did:
SparkConf sparkConf = new SparkConf();
sparkConf.setAppName("aaa");
sparkConf.setMaster("local[*]");
JavaSparkContext javaSparkContext = new JavaSparkContext(sparkConf);
SparkSession sc = SparkSession.builder().sparkContext(javaSparkContext.sc()).getOrCreate();
String data = "{\"temp\":25, \"hum01\":50, \"env\":{\"lux\":1000, \"geo\":[32.5, 43.8]}}";
String querySql = "select env.lux as abc from testData";
System.out.println("start 01, time is"+System.currentTimeMillis());
List<String> dataList = Arrays.asList(data);
Dataset<String> dataset = sc.createDataset(dataList, Encoders.STRING());
dataset.printSchema();
System.out.println("start 02, time is"+System.currentTimeMillis());
Dataset<Row> df = sc.read().json(dataset);
System.out.println("start 03, time is"+System.currentTimeMillis());
List<String> queryResultJson = null;
try{
df.createOrReplaceTempView("testData");
System.out.println("start 04, time is"+System.currentTimeMillis());
Dataset<Row> queryData = sc.sql(querySql);
System.out.println("start 05, time is"+System.currentTimeMillis());
queryResultJson = queryData.toJSON().collectAsList();
System.out.println("start 06, time is"+System.currentTimeMillis());
}catch (Exception e) {
e.printStackTrace();
} finally {
sc.catalog().dropTempView("testData");
}
The result is look like this:
start 01, time is1543457455652
start 02, time is1543457458766
start 03, time is1543457459993
start 04, time is1543457460190
start 05, time is1543457460334
start 06, time is1543457460818
It seems like that the dataset creation process takes too much time. I want to use this function in a streaming data process flow. But the performance is too poor to use.
Is there any way to make dataset creation go faster? Or is there any other method to query a Json data with SQL like language?