0

I'm trying to write a custom function that takes an RDD, lower cases each record, splits it into characters, and then uses each record as the key in a key value pair where the value is always 1. I've written two other custom functions that do the lower casing and the character splitting, to_lower() and to_characters(), respectively.

I've tried a few different things, but so far I've only been able to get the entire list as the key instead of each record being in its own pair.

#Attempt 1
def rdd_to_character_value_pairs(rdd):
  lowerRDD = rdd.map(lambda x: to_lower(x))
  characterRDD = lowerRDD.map(lambda x: to_characters(x))
  pairedRDD = characterRDD.map(lambda x: ([char for char in characterRDD], 1))
return pairedRDD

#Attempt 2
def rdd_to_character_value_pairs(rdd):
  lowerRDD = rdd.map(lambda x: to_lower(x))
  characterRDD = lowerRDD.map(lambda x: to_characters(x))
  for i in characterRDD.collect():
     return ([char for char in characterRDD], 1)
     #have also tried return (i,1)

I understand that you can't iterate over an RDD, but I haven't been able to get any of the workarounds to work either.

  • Welcome to SO! Check out the [tour]. What's your question exactly? I'm not familiar with RDDs so I'm not sure if you're looking for something specific to them or something more generic like [building a dict from a list of keys with all the same value](https://stackoverflow.com/q/11977730/4518341). In any case, it would also help to provide a [mre] with example input, desired output, and actual output, and to remove the stuff about`to_lower` and `to_characters` since it doesn't seem relevant to the question. You can [edit] your post. – wjandrea Feb 28 '21 at 15:14
  • BTW, `lambda x: func(x)` is redundant. Just use `func` instead. I believe this is called *eta-reduction*. – wjandrea Feb 28 '21 at 15:17
  • https://stackoverflow.com/questions/36708338/pyspark-pipelinedrdd-object-is-not-iterable – tevemadar Mar 09 '21 at 08:53

1 Answers1

0
some_list = ["value1", "value2"]
some_dict = {value_as_key: 'some_value' for value_as_key in some_list}

Output:

{'value1': 'some_value', 'value2': 'some_value'}
David Meu
  • 1,527
  • 9
  • 14