How to get/build a JavaRDD[DataSet]？

Question

When I use deeplearning4j and try to train a model in Spark

public MultiLayerNetwork fit(JavaRDD<DataSet> trainingData)

fit() need a JavaRDD parameter, I try to build like this

    val totalDaset = csv.map(row => {
      val features = Array(
        row.getAs[String](0).toDouble, row.getAs[String](1).toDouble
      )
      val labels = Array(row.getAs[String](21).toDouble)
      val featuresINDA = Nd4j.create(features)
      val labelsINDA = Nd4j.create(labels)
      new DataSet(featuresINDA, labelsINDA)
    })

but the tip of IDEA is No implicit arguments of type:Encode[DataSet]
it's a error and I dont know how to solve this problem,
I know SparkRDD can transform to JavaRDD, but I dont know how to build a Spark RDD[DataSet]
DataSet is in import org.nd4j.linalg.dataset.DataSet
Its construction method is

    public DataSet(INDArray first, INDArray second) {
        this(first, second, (INDArray)null, (INDArray)null);
    }

this is my code

val spark:SparkSession = {SparkSession
      .builder()
      .master("local")
      .appName("Spark LSTM Emotion Analysis")
      .getOrCreate()
    }
    import spark.implicits._
    val JavaSC = JavaSparkContext.fromSparkContext(spark.sparkContext)

    val csv=spark.read.format("csv")
      .option("header","true")
      .option("sep",",")
      .load("/home/hadoop/sparkjobs/LReg/data.csv")

    val totalDataset = csv.map(row => {
      val features = Array(
        row.getAs[String](0).toDouble, row.getAs[String](1).toDouble
      )
      val labels = Array(row.getAs[String](21).toDouble)
      val featuresINDA = Nd4j.create(features)
      val labelsINDA = Nd4j.create(labels)
      new DataSet(featuresINDA, labelsINDA)
    })

    val data = totalDataset.toJavaRDD

create JavaRDD[DataSet] by Java in deeplearning4j official guide:

String filePath = "hdfs:///your/path/some_csv_file.csv";
JavaSparkContext sc = new JavaSparkContext();
JavaRDD<String> rddString = sc.textFile(filePath);
RecordReader recordReader = new CSVRecordReader(',');
JavaRDD<List<Writable>> rddWritables = rddString.map(new StringToWritablesFunction(recordReader));

int labelIndex = 5;         //Labels: a single integer representing the class index in column number 5
int numLabelClasses = 10;   //10 classes for the label
JavaRDD<DataSet> rddDataSetClassification = rddWritables.map(new DataVecDataSetFunction(labelIndex, numLabelClasses, false));

I try to create by scala:

    val JavaSC: JavaSparkContext = new JavaSparkContext()
    val rddString: JavaRDD[String] = JavaSC.textFile("/home/hadoop/sparkjobs/LReg/hf-data.csv")
    val recordReader: CSVRecordReader = new CSVRecordReader(',')
    val rddWritables: JavaRDD[List[Writable]] = rddString.map(new StringToWritablesFunction(recordReader))
    val featureColnum = 3
    val labelColnum = 1
    val d = new DataVecDataSetFunction(featureColnum,labelColnum,true,null,null)
//    val rddDataSet: JavaRDD[DataSet] = rddWritables.map(new DataVecDataSetFunction(featureColnum,labelColnum, true,null,null))
// can not reslove overloaded method 'map'

debug error infomations:

Adam Gibson · Answer 1 · 2023-06-07T22:41:03.570

1

A DataSet is just a pair of INDArrays. (inputs and labels) Our docs cover this in depth: https://deeplearning4j.konduit.ai/distributed-deep-learning/data-howto

For stack overflow sake, I'll summarize what's here since there's no "1" way to create a data pipeline. It's relative to your problem. It's very similar to how you you would create a dataset locally, generally you want to take whatever you do locally and put that in to spark in a function.

CSVs and images for example are going to be very different. But generally you use the datavec library to do that. The docs summarize the approach for each kind.

Edit: The user's error here for future reference appears to be a mismatches spark version. Sorry unfortunately I was never notified of the response after the edit and wasn't able to reply to this.

In regards to the recent comment my answer still stands. You create data pipelines using map functions. There are different ways of doing so but it depends on whether there's a CSV, image or something else.

edited Jun 07 '23 at 22:41

answered Jul 19 '20 at 22:13

Adam Gibson

3,055
1
10
12

Thank you for your reply, but I only found the Java implementation in your documentation, can you tell me how to implement it in scala, this is my code to read the content of the csv file – WaterEast Jul 20 '20 at 02:59
Scala can use java classes. There isn't that much of a difference. Unfortunately, we don't have a ton of scala examples (and we purged a lot of our older examples). You are welcome to contribute some if you would like, otherwise you will have to map the java to scala yourself. – Adam Gibson Jul 20 '20 at 05:33
I try to use scala to map the java code in the document，but in the last line of code, it says that the map function cannot be overloaded. I commented this line and debugged it but produced an error. I re-edited the specific content in the original question. I hope to get your reply. Thanks again. – WaterEast Jul 20 '20 at 08:10
Broken link. This is why link-only answers are discouraged. – Wheezil Jun 07 '23 at 16:25
Hi, did you have a specific question? The answer actually already mentioned the question was too vague since there' more than 1 way to create a pipeline. An RDD of DataSet is a map function over a lambda which converts the raw data to a dataset. I mentioned the data pipeline there but the overall premise is the same. – Adam Gibson Jun 07 '23 at 22:39

How to get/build a JavaRDD[DataSet]？

1 Answers1