How can I use Spark to tokenize remote documents where I store only the URL?

Question

I'm a Spark newbie just reading a few books. Is it possible to tokenize remote documents with only their url? For example would there be a may to modify this script (taken from this blog post):

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._

object TokenizerApp {
 def main(args: Array[String]) {
 val logFile = "src/data/sample.txt" // Should be some file on your system
 val sc = new SparkContext("local", "Tokenizer App", "/path/to/spark-0.9.1-incubating",
 List("target/scala-2.10/simple-project_2.10-1.0.jar"))
 val logData = sc.textFile(logFile, 2).cache()
 val tokens = sc.textFile(logFile, 2).flatMap(line => line.split(" "))
 val termFrequency = tokens.map(word => (word, 1)).reduceByKey((a, b) => a + b)
 termFrequency.collect.map(tf => println("Term, Frequency: " + tf))
 tokens.saveAsTextFile("src/data/tokens")
 termFrequency.saveAsTextFile("src/data/term_frequency")
 }
}

score 1 · Accepted Answer · edited Jun 08 '21 at 23:13

Is it possible to tokenize remote documents with only their url?

Sure, it is possible. First choose your favorite HTTP library. Next write a small wrapper which takes url and returns related content:

def getContent(url: String): String = ???

Next create a RDD of urls:

import org.apache.spark.rdd.RDD

val urls: RDD[String] = ???

Map over it using helper function and cache:

val contents: RDD[(String, String)] = urls.map(url => (url, getContent(url)))
contents.cache

Convert to a DataFrame so we can use built-in tokenizer:

import org.apache.spark.sql.DataFrame

val df: DataFrame = contents.toDF("url", "text")

Create tokenizer and tokenize:

import org.apache.spark.ml.feature.Tokenizer

val tokenizer = new Tokenizer().setInputCol("text").setOutputCol("tokens")

Tokenize:

val tokenized = tokenizer.transform(df)

Seriously though, don't. Using Spark for a job like this is a really bad idea for more or less the same reason why it doesn't make sense to run Spark job only for side-effects.

If you want to utilize Spark in a web scraping pipeline you can either use scraper as an input stream source in Spark Streaming (it is probably an overkill) or simply collect data to a persistent storage and process with Spark afterwards.

How can I use Spark to tokenize remote documents where I store only the URL?

1 Answers1