I have huge dataframe around 7GB records. I am trying to get the count of the dataframe and download it as csv Both of them result in below error. is there any other way of downloading the dataframe without multiple partitions
print(df.count())
df.coalesce(1).write.option("header", "true").csv('/user/ABC/Output.csv')
Error:
java.io.IOException: Stream is corrupted
at net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:202)
at net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:228)
at net.jpountz.lz4.LZ4BlockInputStream.read(LZ4BlockInputStream.java:157)
at org.apache.spark.io.ReadAheadInputStream$1.run(ReadAheadInputStream.java:168)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
20/05/26 18:15:44 ERROR scheduler.TaskSetManager: Task 8 in stage 360.0 failed 1 times; aborting job
[Stage 360:=======> (8 + 1) / 60]
Py4JJavaError: An error occurred while calling o18867.count.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 8 in stage 360.0 failed 1 times, most recent failure: Lost task 8.0 in stage 360.0 (TID 13986, localhost, executor driver): java.io.IOException: Stream is corrupted
at net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:202)
at net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:228)
at net.jpountz.lz4.LZ4BlockInputStream.read(LZ4BlockInputStream.java:157)
at org.apache.spark.io.ReadAheadInputStream$1.run(ReadAheadInputStream.java:168)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)