I have a DataFrame that I want to loop through its rows and add its values to a list that can be used by the driver? Broadcast variables are read-only and as far as I know accumulators are only for sum.
Is there a way to do this? am using spark 1.6.1
Here is the code that runs on the worker nodes. I tried passing the List to the constructor, but it did not work as it seems once the code is streamed to the worker nodes it does not return any values to the driver.
public class EnrichmentIdentifiersBuilder implements Serializable{
/**
*
*/
private static final long serialVersionUID = 269187228897275370L;
private List<Map<String, String>> extractedIdentifiers;
public EnrichmentIdentifiersBuilder(List<Map<String, String>> extractedIdentifiers) {
//super();
this.extractedIdentifiers = extractedIdentifiers;
}
public void addIdentifiers(DataFrame identifiers)
{
final List<String> parameters=Arrays.asList(identifiers.schema().fieldNames());
identifiers.foreach(new MyFunction<Row, BoxedUnit>() {
/**
*
*/
private static final long serialVersionUID = 1L;
@Override
public BoxedUnit apply(Row line)
{
for (int i = 0; i < parameters.size(); i++)
{
Map<String, String> identifier= new HashMap<>();
identifier.put(parameters.get(i), line.getString(i));
extractedIdentifiers.add(identifier);
}
return BoxedUnit.UNIT;
}
});
}
}