1

Using rdd.map(lambda x: .....) in pyspark I need to write a lambda-function that is supposed to format a string.

For example I have a string "abcdefgh" and in each row of a column after each two symbols I want to insert "-" in order to get "ab-cd-ef-gh".

How could I implement it using the code like this with right pyspark-syntaxis:

df.rdd.map(lambda x: ((for i in range(10): x[i+2:2] + "-"),)).toDF()
user2314737
  • 27,088
  • 20
  • 102
  • 114
anton_ld
  • 17
  • 5

1 Answers1

0

There's some syntax error in your map function. Try this:

sc = spark.sparkContext

rdd = sc.parallelize(["abcdefg", "hijklmno"])  
rdd.collect()
# Out: ['abcdefg', 'hijklmno']

rdd.map(lambda x: '-'.join([x[i:i+2] for i in range(0, len(x), 2)])).collect()
# Out:['ab-cd-ef-g', 'hi-jk-lm-no']

Alternatively:

from  itertools import zip_longest
rdd.map(lambda x: '-'.join(map(''.join, zip_longest(*[iter(x)]*2, fillvalue='')))) \
.collect()
# Out: ['ab-cd-ef-g', 'hi-jk-lm-no']

Or even shorter:

from textwrap import wrap
rdd.map(lambda x: '-'.join(wrap(x, 2))).collect()
# Out: ['ab-cd-ef-g', 'hi-jk-lm-no']

(see Split string every nth character?)

user2314737
  • 27,088
  • 20
  • 102
  • 114