I'm trying to expend my RDD table by one column (with string values) using this question answers but I cannot add a column name this way... I'm using Scala.
Is there any easy way to add a column to RDD?
I'm trying to expend my RDD table by one column (with string values) using this question answers but I cannot add a column name this way... I'm using Scala.
Is there any easy way to add a column to RDD?
Apache Spark has a functional approach to the elaboration of data. Fundamentally, an RDD[T]
is some sort of collection of objects (RDD
stands for Resilient Distributed Data structure).
Following the functional approach, you elaborate the objects inside the RDD
using transformations. Transformations construct a new RDD
from a previous one.
One example of transformation is the map
method. Using map
, you can transform each object of your RDD
in every other type of object you need. So, if you have a data structure that represents a row, you can trasform that structure in a new one with an added row.
For example, take the following piece of code.
val rdd: (String, String) = sc.pallelize(List(("Hello", "World"), ("Such", "Wow"))
// This new RDD will have one more "column",
// which is the concatenation of the previous
val rddWithOneMoreColumn =
rdd.map {
case(a, b) =>
(a, b, a + b)
In this example an RDD
of Tuple2
(a.k.a. a couple) is transformed into an RDD
of Tuple3
, simply applying a function to each RDD
element.
Clearly, you have to apply an action over the object rddWithOneMoreColumn
to make the computation happen. In fact, Apache Spark computes lazily the result of all of your transformation.