6

My Spark program reads a file that contains gzip compressed string that encoded64. I have to decode and decompress. I used spark unbase64 to decode and generated byte array

bytedf=df.withColumn("unbase",unbase64(col("value")) )

Is there any spark method available in spark that decompresses bytecode?

ranjith reddy
  • 481
  • 2
  • 8
  • 19

3 Answers3

4

I wrote a udf

def decompress(ip):

    bytecode = base64.b64decode(x)
    d = zlib.decompressobj(32 + zlib.MAX_WBITS)
    decompressed_data = d.decompress(bytecode )
    return(decompressed_data.decode('utf-8'))



decompress = udf(decompress)
decompressedDF = df.withColumn("decompressed_XML",decompress("value"))
ranjith reddy
  • 481
  • 2
  • 8
  • 19
4

I have a similar case, in my case, I do this:

from pyspark.sql.functions import col,unbase64,udf
from gzip import decompress

bytedf=df1.withColumn("unbase",unbase64(col("payload")))
decompress_func = lambda x: decompress(x).decode('utf-8')
udf_decompress = udf(decompress_func)
df2 = bytedf.withColumn('unbase_decompress', udf_decompress('unbase'))
1

Spark example using base64-

import base64
.
.
#decode base 64 string using map operation or you may create udf.
df.map(lambda base64string: base64.b64decode(base64string), <string encoder>)

Read here for detailed python example.

Rahul Sharma
  • 5,614
  • 10
  • 57
  • 91