I am currently trying to process large volumes of text. As part of the process I want to do things such as tokenization and stemming. However some of my steps require loading an external model (for example the OpenNLP tokenizers). I am currently trying the following approach:
SparkConf sparkConf = new SparkConf().setAppName("Spark Tokenizer");
JavaSparkContext sparkContext = new JavaSparkContext(sparkConf);
SQLContext sqlContext = new SQLContext(sparkContext);
DataFrame corpus = sqlContext.read().text("/home/zezke/document.nl");
// Create pipeline components
Tokenizer tokenizer = new Tokenizer()
.setInputCol("value")
.setOutputCol("tokens");
DataFrame tokenizedCorpus = tokenizer.transform(corpus);
// Save the output
tokenizedCorpus.write().mode(SaveMode.Overwrite).json("/home/zezke/experimentoutput");
The current approach I am trying is using a UnaryTransformer.
public class Tokenizer extends UnaryTransformer<String, List<String>, Tokenizer> implements Serializable {
private final static String uid = Tokenizer.class.getSimpleName() + "_" + UUID.randomUUID().toString();
private static Map<String, String> stringReplaceMap;
@Override
public void validateInputType(DataType inputType) {
assert (inputType.equals(DataTypes.StringType)) :
String.format("Input type must be %s, but got %s", DataTypes.StringType.simpleString(), inputType.simpleString());
}
public Function1<String, List<String>> createTransformFunc() {
Function1<String, List<String>> f = new TokenizerFunction();
return f;
}
public DataType outputDataType() {
return DataTypes.createArrayType(DataTypes.StringType, true);
}
public String uid() {
return uid;
}
private class TokenizerFunction extends AbstractFunction1<String, List<String>> implements Serializable {
public List<String> apply(String sentence) {
... code goes here
}
}
}
Now my questions are:
- What is the best time to load the model? I don't want to load the model multiple times.
- How do I distribute the model to various nodes?
Thanks in advance, Spark is a bit daunting to get into, but it looks promising.