Pyspark: Invalid status code '400' when lazy loading the dataframe

Question

I am having Invalid status code '400' errors with every time I tried to show the pyspark dataframe. My AWS sagemaker driver and executor memory are 32G.

-Env:

Python version : 3.7.6
pyspark version : '2.4.5-amzn-0'
Notebook instance : 'ml.t2.2xlarge'

-EMR cluster config

{"classification":"livy-conf","properties":{"livy.server.session.timeout":"5h"}},
{"classification":"spark-defaults","properties":{"spark.driver.memory":"20G"}}

After some manipulation, I cleaned data and reduced the data size. The dataframe should be correct

print(df.count(), len(df.columns))
print(df.show())

(1642, 9)

 stock     date     time   spread  time_diff    ...
  VOD      01-01    9:05    0.01     1132       ...
  VOD      01-01    9:12    0.03     465        ...
  VOD      01-02   10:04    0.02     245
  VOD      01-02   10:15    0.01     364     
  VOD      01-02   10:04    0.02     12

However if I continue to do filtering,

new_df= df.filter(f.col('time_diff')<= 1800)
new_df.show()

then I got this error

An error was encountered:
Invalid status code '400' from http://11.146.133.8:8990/sessions/34/statements/8 with error payload: {"msg":"requirement failed: Session isn't active."}

I really have no idea whats going on.

Can someone please advise ?

Thanks

It looks like your session is time out and there is a lots of reason causing it time out. Although it's from the EMR, this post might help you: https://stackoverflow.com/questions/58062824/session-isnt-active-pyspark-in-an-aws-emr-cluster — Jonathan Lam, Aug 09 '22 at 01:25
Thanks @Jonathan . I followed those posts as per suggested. Updated livy time out and driver memory, but the issue still exist. — FlyUFalcon, Aug 09 '22 at 09:11
Hi @FlyUFalcon, could you share more about: 1. The original size of your `df` 2. How do you save your data (`parquet` or `csv` or ...)? 3. How many partition do you have in your df? 4. Do you have any data skewness? As you mentioned, you call some `action` like `count()` and `show()` and it's still work at this moment but failed after further processing, I believe it should relate to insufficient memory or single partition transformation overload your executor. — Jonathan Lam, Aug 09 '22 at 11:29
Hi @Jonathan , dataframe shape is (1642, 9) . After I converted it to pandas, the memory usage is 109.2+ KB. Thx. — FlyUFalcon, Aug 09 '22 at 12:14
Hi @FlyUFalcon, 109.2+ KB is your source data size or after transformation? How do you save your source data and how many partition do you have when you read the dataset? — Jonathan Lam, Aug 12 '22 at 02:02
Hi @Jonathan , Yes. 109.2+ KB is the data after converting it to pandas. If I apply ' print(df.rdd.getNumPartitions()) ' I have output: '1' Thanks — FlyUFalcon, Aug 12 '22 at 08:03

score 1 · Answer 1 · answered Aug 12 '22 at 09:13

I haven't seen this error before, but as you mentioned that you have only 1 partition and you got this error in the process but not at the beginning, I believe it should relate to the OOM issue.

Please try to do the repartition based on the total number of core you use:

# read the data, let say you are reading the parquet file and you have total 20 cores
df = spark.read.parquet("/path/of/your/data")
df = df.repartition(20)

Also if your dataframe will be reused, you should use the df.persist().

score 0 · Answer 2 · answered Aug 12 '22 at 15:52

0

You need to change livy.server.session.timeout parameter. Answers here or here

answered Aug 12 '22 at 15:52

stahh

149
5

Thanks Yes. I already did that in my cluster config. – FlyUFalcon Aug 12 '22 at 15:55

score 0 · Answer 3 · answered Aug 16 '22 at 07:57

0

After days of searching for results. I finally got the answer to solve the question. I dont know what wrong with my config setting, but I need to update drivermemory in spark terminal.

just upgrade the memory form there, and the issue will be gone.

answered Aug 16 '22 at 07:57

FlyUFalcon

314
1
4
18

Pyspark: Invalid status code '400' when lazy loading the dataframe

3 Answers3