I need to retain only the elements in spark Dataset for the column name “manufacturer” , which are element present in arraylist
Complete data set has
“weiler", "Hi I heard about Spark",
“weiler", "Hi I heard about Spark",
"weiler", "Hi I heard about Spark",
"west chester","I wish Java could use case classes",
"west chester","I wish Java could use case classes",
"west chester","I wish Java could use case classes",
"wells lamont","Logistic,regression,models,are,neat";
After Appling filter on array list need
“weiler", "Hi I heard about Spark",
“weiler", "Hi I heard about Spark",
"weiler", "Hi I heard about Spark",
"wells lamont","Logistic,regression,models,are,neat";
I’m trying with below code, but not able to understand how to go further.
try {
System.setProperty("hadoop.home.dir", "C:\\AD_classfication\\Apachespark\\winutil");
JavaSparkContext sc = new JavaSparkContext(new SparkConf().setAppName("SparkJdbcDs").setMaster("local[*]"));
SQLContext sqlContext = new SQLContext(sc);
SparkSession spark = SparkSession.builder()
.appName("JavaTokenizerExample")
.getOrCreate();
List<Row> data = Arrays.asList(
RowFactory.create("weiler", "Hi I heard about Spark"),
RowFactory.create("weiler", "Hi I heard about Spark"),
RowFactory.create("weiler", "Hi I heard about Spark"),
RowFactory.create("west chester","I wish Java could use case classes"),
RowFactory.create("west chester","I wish Java could use case classes"),
RowFactory.create("west chester","I wish Java could use case classes"),
RowFactory.create("wells lamont","Logistic,regression,models,are,neat")
);
StructType schema = new StructType(new StructField[] {
new StructField("manufacturer", DataTypes.IntegerType, false,
Metadata.empty()),
new StructField("sentence", DataTypes.StringType, false,
Metadata.empty())
});
ArrayList<String> uniqueManufacturer = new ArrayList<String>();
uniqueManufacturer.add("weiler");
uniqueManufacturer.add("wells lamont");
Dataset<Row> sentenceDataFrame = spark.createDataFrame(data, schema);
List<Row> distinctManufacturerNamesList=sentenceDataFrame.filter("manufacturer");
sentenceDataFrame.show();
} catch (Exception e) {
e.printStackTrace();
}