Text Preprocessing in Spark-Scala

Question

I want to apply preprocessing phase on a large amount of text data in Spark-Scala such as Lemmatization - Remove Stop Words(using Tf-Idf) - POS tagging , there is any way to implement them in Spark - Scala ?

for example here is one sample of my data:

The perfect fit for my iPod photo. Great sound for a great price. I use it everywhere. it is very usefulness for me.

after preprocessing:

perfect fit iPod photo great sound great price use everywhere very useful

and they have POS tags e.g (iPod,NN) (photo,NN)

there is a POS tagging (sister.arizona) is it applicable in Spark?

What have you done so far? Where have your research efforts brought you up to now? — maasg, Apr 28 '15 at 14:45
https://www.nicolaferraro.me/2016/02/22/using-non-serializable-objects-in-apache-spark/ -- does wrapping it in an object work? — Andrew Scott Evans, Aug 29 '16 at 16:17

score 12 · Accepted Answer · answered Apr 28 '15 at 13:46

Anything is possible. The question is what YOUR preferred way of doing this would be.

For example, do you have a stop word dictionary that works for you (it could just simply be a Set), or would you want to run TF-IDF to automatically pick the stop words (note that this would require some supervision, such as picking the threshold at which the word would be considered a stop word). You can provide the dictionary, and Spark's MLLib already comes with TF-IDF.

The POS tags step is tricky. Most NLP libraries on the JVM (e.g. Stanford CoreNLP) don't implement java.io.Serializable, but you can perform the map step using them, e.g.

myRdd.map(functionToEmitPOSTags)

On the other hand, don't emit an RDD that contains non-serializable classes from that NLP library, since steps such as collect(), saveAsNewAPIHadoopFile, etc. will fail. Also to reduce headaches with serialization, use Kryo instead of the default Java serialization. There are numerous posts about this issue if you google around, but see here and here.

Once you figure out the serialization issues, you need to figure out which NLP library to use to generate the POS tags. There are plenty of those, e.g. Stanford CoreNLP, LingPipe and Mallet for Java, Epic for Scala, etc. Note that you can of course use the Java NLP libraries with Scala, including with wrappers such as the University of Arizona's Sista wrapper around Stanford CoreNLP, etc.

Also, why didn't your example lower-case the processed text? That's pretty much the first thing I would do. If you have special cases such as iPod, you could apply the lower-casing except in those cases. In general, though, I would lower-case everything. If you're removing punctuation, you should probably first split the text into sentences (split on the period using regex, etc.). If you're removing punctuation in general, that can of course be done using regex.

How deeply do you want to stem? For example, the Porter stemmer (there are implementations in every NLP library) stems so deeply that "universe" and "university" become the same resulting stem. Do you really want that? There are less aggressive stemmers out there, depending on your use case. Also, why use stemming if you can use lemmatization, i.e. splitting the word into the grammatical prefix, root and suffix (e.g. walked = walk (root) + ed (suffix)). The roots would then give you better results than stems in most cases. Most NLP libraries that I mentioned above do that.

Also, what's your distinction between a stop word and a non-useful word? For example, you removed the pronoun in the subject form "I" and the possessive form "my," but not the object form "me." I recommend picking up an NLP textbook like "Speech and Language Processing" by Jurafsky and Martin (for the ambitious), or just reading the one of the engineering-centered books about NLP tools such as LingPipe for Java, NLTK for Python, etc., to get a good overview of the terminology, the steps in an NLP pipeline, etc.

I gave you a whole list of tools (e.g. how to serialize NLP libraries on Spark), algorithms (lemmatization vs. stemming), and ideas to try out. Also, was it not OK for me to ask a question? You asked a question so I wanted to clarify. And yes, "sport" can be treated as a stop word if e.g. TF-IDF detects that it's a high-frequency word but not specific to a document. Why wasn't it useful? What did I miss? If you were looking for actual code then as I said, it's case-dependent (e.g. Porter stemming may be too deep, you may choose competing libraries like CoreNLP vs. Epic, etc.). — marekinfo, Apr 28 '15 at 14:12
I know whole list of tools and algorithms that you have mentioned , but i want to know simplest, useful and applicable ways. thanks — Esmaeil zahedi, Apr 28 '15 at 14:38
As I mentioned, the simplest and most applicable ways depend on your use case - for example, the depth of stemming (e.g. Porter vs. snowball), or choosing stemming vs. lemmatization, depends on your problem. There is no universally ideal approach. What's the exact problem that you're trying to solve? POS tagging etc. is "busywork," it's a step in the process, not an end in itself. Also, "ways" or fully spelled-out code? — marekinfo, Apr 28 '15 at 14:41
@Esmaeilzahedi "thank you for your reply , but it was not helpful" - that's quite rude considering the amount of information you've been provided on a question that shows absolutely no research effort. — maasg, Apr 28 '15 at 14:43
i want to extract nouns and adjectives in(1_gram, bigrams ,trigrams,n_grams) of a large amount of document , however before doing this work i should apply preprocessing on text , i prefer lemmatization and stop word removing (using TF-IDF) and a simple POS tagging — Esmaeil zahedi, Apr 28 '15 at 14:50
Well then, as I mentioned, the TF-IDF example has already been posted on the Spark [documentation website](https://spark.apache.org/docs/1.2.0/mllib-feature-extraction.html#tf-idf). The Stanford Core NLP lemmatization example is [here](http://stackoverflow.com/questions/1578062/lemmatization-java) in Java, so it should be a trivial port to Scala. — marekinfo, Apr 28 '15 at 14:59

score 1 · Answer 2 · answered Apr 28 '15 at 21:01

1

There is no built-in NLP capability in Apache Spark. You would have to implement it for yourself, perhaps based on a non-distributed NLP library, as described in marekinfo's excellent answer.

answered Apr 28 '15 at 21:01

Daniel Darabos

26,991
10
102
114

2

That is no longer true - see Spark-CoreNLP https://github.com/databricks/spark-corenlp – Serendipity Jul 25 '16 at 10:57

score 1 · Answer 3 · answered Apr 29 '15 at 01:25

1

I would suggest you to take a look in spark's ml pipeline. You may not get everything out of the box yet, but you can build your capabililties and use pipeline as a framework..

answered Apr 29 '15 at 01:25

ayan guha

1,249
10
7

Text Preprocessing in Spark-Scala

3 Answers3