Multiple filter on Dataset in Apache spark

Question

I need to retain only the elements in spark Dataset for the column name “manufacturer” , which are element present in arraylist

Complete data set has

“weiler", "Hi I heard about Spark",
“weiler", "Hi I heard about Spark",
"weiler", "Hi I heard about Spark",
"west chester","I wish Java could use case classes",
"west chester","I wish Java could use case classes",
"west chester","I wish Java could use case classes",
"wells lamont","Logistic,regression,models,are,neat";

After Appling filter on array list need

“weiler", "Hi I heard about Spark",
“weiler", "Hi I heard about Spark",
"weiler", "Hi I heard about Spark",
"wells lamont","Logistic,regression,models,are,neat";

I’m trying with below code, but not able to understand how to go further.

try {                
    System.setProperty("hadoop.home.dir", "C:\\AD_classfication\\Apachespark\\winutil");
    JavaSparkContext sc = new JavaSparkContext(new SparkConf().setAppName("SparkJdbcDs").setMaster("local[*]"));
    SQLContext sqlContext = new SQLContext(sc);
    SparkSession spark = SparkSession.builder()
                                     .appName("JavaTokenizerExample")
                                     .getOrCreate();

    List<Row> data = Arrays.asList(
                          RowFactory.create("weiler", "Hi I heard about Spark"),
                          RowFactory.create("weiler", "Hi I heard about Spark"),
                          RowFactory.create("weiler", "Hi I heard about Spark"),
                          RowFactory.create("west chester","I wish Java could use case classes"),
                          RowFactory.create("west chester","I wish Java could use case classes"),
                          RowFactory.create("west chester","I wish Java could use case classes"),
                          RowFactory.create("wells lamont","Logistic,regression,models,are,neat")
                    );

    StructType schema = new StructType(new StructField[] {
                            new StructField("manufacturer", DataTypes.IntegerType, false,
                                      Metadata.empty()),
                            new StructField("sentence", DataTypes.StringType, false,
                                      Metadata.empty()) 
                        });

    ArrayList<String> uniqueManufacturer = new ArrayList<String>();
    uniqueManufacturer.add("weiler");
    uniqueManufacturer.add("wells lamont");


    Dataset<Row> sentenceDataFrame = spark.createDataFrame(data, schema);
    List<Row> distinctManufacturerNamesList=sentenceDataFrame.filter("manufacturer");
    sentenceDataFrame.show();
} catch (Exception e) {
        e.printStackTrace();
}

Multiple filter on Dataset in Apache spark

0 Answers0