0

I have huge dataframe around 7GB records. I am trying to get the count of the dataframe and download it as csv Both of them result in below error. is there any other way of downloading the dataframe without multiple partitions

print(df.count())
df.coalesce(1).write.option("header", "true").csv('/user/ABC/Output.csv')



Error:
java.io.IOException: Stream is corrupted
    at net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:202)
    at net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:228)
    at net.jpountz.lz4.LZ4BlockInputStream.read(LZ4BlockInputStream.java:157)
    at org.apache.spark.io.ReadAheadInputStream$1.run(ReadAheadInputStream.java:168)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
20/05/26 18:15:44 ERROR scheduler.TaskSetManager: Task 8 in stage 360.0 failed 1 times; aborting job
[Stage 360:=======>                                                (8 + 1) / 60]
Py4JJavaError: An error occurred while calling o18867.count.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 8 in stage 360.0 failed 1 times, most recent failure: Lost task 8.0 in stage 360.0 (TID 13986, localhost, executor driver): java.io.IOException: Stream is corrupted
    at net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:202)
    at net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:228)
    at net.jpountz.lz4.LZ4BlockInputStream.read(LZ4BlockInputStream.java:157)
    at org.apache.spark.io.ReadAheadInputStream$1.run(ReadAheadInputStream.java:168)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Eden T
  • 57
  • 1
  • 8
  • Just wanted to point out that the error happens while reading input file/materializing a dataframe, not when you write it to .csv. – mazaneicha May 27 '20 at 02:30
  • Since you are running the job in local mode, what is your cluster info like RAM available and the number of nodes? – Shubham Jain May 27 '20 at 05:45

0 Answers0