- If I have a rdd, how do I understand the data is in key:value
format? is there a way to find the same - something like
type(object) tells me an object's type. I tried
print type(rdd.take(1))
, but it just says<type 'list'>
. - Let's say I have a data like
(x,1),(x,2),(y,1),(y,3)
and I usegroupByKey
and got(x,(1,2)),(y,(1,3))
. Is there a way to define(1,2)
and(1,3)
as values where x and y are keys? Or does a key has to be a single value? I noted that if I usereduceByKey
andsum
function to get the data((x,3),(y,4))
then it becomes much easier to define this data as a key-value pair

- 11,659
- 11
- 40
- 58

- 5,760
- 25
- 91
- 159
-
1. `rdd.first()` 2. Please clarify. `groupByKey` is usually for cases you really eventually need the entire list. – Tom Ron Feb 29 '16 at 15:39
-
1. wouldn't `rdd.first()` return me just the first datapoint? I want to know whether the data is in a key-value format or not. 2.Yes, I have used `groupByKey` to get the entire data, but i want it in key-value format – user2543622 Feb 29 '16 at 15:56
-
You want it as a map? What about collectAsMap? Taking the first you will get a tuple, what do you mean by key-value format? what kind of type do you expect? – Tom Ron Mar 01 '16 at 09:53
-
i couldnt find a good simple source on collectAsMap. Please share if you have anything. Would it be possible to provide a simple example? – user2543622 Mar 01 '16 at 13:53
-
See collectAsMap here - http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD - `rdd.collectAsMap()` – Tom Ron Mar 01 '16 at 13:57
-
thanks. So if I run `groupByKey` on my original data and then run `collectAsMap()`, would it convert my data in dictionary format even if my second elements are lists (and not a single value)? – user2543622 Mar 01 '16 at 14:20
-
1Just try it yourself. The output would roughly be - {"a" : [1, 2, 3], "b" : [4], ..} – Tom Ron Mar 01 '16 at 14:48
1 Answers
Python is a dynamically typed language and PySpark doesn't use any special type for key, value pairs. The only requirement for an object being considered a valid data for PairRDD
operations is that it can be unpacked as follows:
k, v = kv
Typically you would use a two element tuple
due to its semantics (immutable object of fixed size) and similarity to Scala Product
classes. But this is just a convention and nothing stops you from something like this:
key_value.py
class KeyValue(object):
def __init__(self, k, v):
self.k = k
self.v = v
def __iter__(self):
for x in [self.k, self.v]:
yield x
from key_value import KeyValue
rdd = sc.parallelize(
[KeyValue("foo", 1), KeyValue("foo", 2), KeyValue("bar", 0)])
rdd.reduceByKey(add).collect()
## [('bar', 0), ('foo', 3)]
and make an arbitrary class behave like a key-value. So once again if something can be correctly unpacked as a pair of objects then it is a valid key-value. Implementing __len__
and __getitem__
magic methods should work as well. Probably the most elegant way to handle this is to use namedtuples
.
Also type(rdd.take(1))
returns a list
of length n
so its type will be always the same.

- 322,348
- 103
- 959
- 935
-
I am learning from you. But I am still confused about something. Let's say for whatever reason, I used `groupByKey`, I will get `[('bar', (0)), ('foo', (1,2))]`...now can I use something like `rdd.map(lambda x: (x[0],len(x[1]))`? I know the same can be done using `countByKey`, but I want to use the 'groupByKey' – user2543622 Feb 29 '16 at 16:40
-
`(0)` is not a valid `tuple` literal. It is just `0`. Otherwise exactly like this. – zero323 Feb 29 '16 at 17:00