I have an Azure system divided into three parts:
- Azure Data lake storage where I have some csv file.
- Azure Databricks where I need to make some process - exactly is to convert that csv file to Redis hash format.
- Azure Redis cache where I should put that converted data.
After mounting storage in databricks filesystem there is a need to process some data. How to convert csv data located in databricks filesystem to redisHash format and correctly to put it to Redis? Specifically, I'm not sure how to make a correct mapping by the following code below. Or maybe is there some way of additional transfer to SQL table which I cannot find.
Here is my example of code written on scala:
import com.redislabs.provider.redis._
val redisServerDnsAddress = "HOST"
val redisPortNumber = 6379
val redisPassword = "Password"
val redisConfig = new RedisConfig(new RedisEndpoint(redisServerDnsAddress, redisPortNumber, redisPassword))
val data = spark.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").load("/mnt/staging/data/file.csv")
// What is the right way of mapping?
val ds = table("data").select("Prop1", "Prop2", "Prop3", "Prop4", "Prop5" ).distinct.na.drop().map{x =>
(x.getString(0), x.getString(1), x.getString(2), x.getString(3), x.getString(4))
}
sc.toRedisHASH(ds, "data")
The error:
error: type mismatch;
found : org.apache.spark.sql.Dataset[(String, String)]
required: org.apache.spark.rdd.RDD[(String, String)]
sc.toRedisHASH(ds, "data")
If I write the last string of code this way:
sc.toRedisHASH(ds.rdd, "data")
The error:
org.apache.spark.sql.AnalysisException: Table or view not found: data;