0

I am playing around with PySpark with the following code:

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Scoring System").getOrCreate()

df = spark.read.csv('output.csv')

df.show()

after I ran python trial.py on the command line it has been around 5 to 10 minutes, with no progression:

To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2019-05-05 22:58:31 WARN  Utils:66 - Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
2019-05-05 22:58:32 WARN  Client:66 - Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
[Stage 0:>                                                          (0 + 0) / 1]2019-05-05 23:00:08 WARN  YarnScheduler:66 - Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
2019-05-05 23:00:23 WARN  YarnScheduler:66 - Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
2019-05-05 23:00:38 WARN  YarnScheduler:66 - Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
2019-05-05 23:00:53 WARN  YarnScheduler:66 - Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
[Stage 0:>                                                          (0 + 0) / 1]2019-05-05 23:01:08 WARN  YarnScheduler:66 - Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
2019-05-05 23:01:23 WARN  YarnScheduler:66 - Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
2019-05-05 23:01:38 WARN  YarnScheduler:66 - Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

I am hunching that I am lacking resources in my worker node(?), or am I missing something?

Gerard
  • 518
  • 4
  • 19
  • What is the size of the file you are trying to read? – Vitaliy Apr 07 '19 at 18:04
  • Hi @Vitaliy, Its around 21 GB. – Gerard May 06 '19 at 03:21
  • 1
    My guess is that it is trying to infer the schema. Doing this requires loading a large portion of the file to understand the nature of the data. Try specifying the schema explicitly (this is a best practice in general). You can see an example of how to do this at https://stackoverflow.com/a/49281042/180650 – Vitaliy May 14 '19 at 18:52

1 Answers1

0

Try to increase the Number of Executors and memory pyspark --num-executors 5 --executor-memory 1G

Prathik Kini
  • 1,067
  • 11
  • 25