When I tried to run the Pyspark program to just show the elements in the list using Pycharm community edition which uses split function as a space delimiter inside flatmap function, I got the below error. Could you please suggest me a solution. I am getting the error only if I am using split function inside flatmap as below.
b=a.flatMap(lambda line:line.split(";"))
I am using Pycharm community edition 2021.2.2 version, Spark 3.4.1 version, hadoop 3.3.6 version.
C:\Users\jagadeeswaran\AppData\Local\Programs\Python\Python311\python.exe C:/Users/jagadeeswaran/PycharmProjects/spark1/spark2.py Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 23/08/29 22:40:11 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable a;1 b;2 c;3 d;4 e;5 23/08/29 22:40:19 ERROR Executor: Exception in task 4.0 in stage 1.0 (TID 12) java.io.IOException: Cannot run program "python3": CreateProcess error=2, The system cannot find the file specified at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1140) at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1074) at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:170) at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:109) at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:124) at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:166) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364) at org.apache.spark.rdd.RDD.iterator(RDD.scala:328) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92) at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) at org.apache.spark.scheduler.Task.run(Task.scala:139) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) at java.base/java.lang.Thread.run(Thread.java:1623) Caused by: java.io.IOException: CreateProcess error=2, The system cannot find the file specified at java.base/java.lang.ProcessImpl.create(Native Method) at java.base/java.lang.ProcessImpl.(ProcessImpl.java:500) at java.base/java.lang.ProcessImpl.start(ProcessImpl.java:159) at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1111) ... 17 more 23/08/29 22:40:19 ERROR Executor: Exception in task 6.0 in stage 1.0 (TID 14) java.io.IOException: Cannot run program "python3": CreateProcess error=2, The system cannot find the file specified at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1140) at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1074)
Below is the code snippet I tried to run,
[code snippet]
import pyspark
# sql functions import
from pyspark.sql.functions import split,col
from pyspark import SparkConf
from pyspark.sql import SparkSession
from pyspark.context import SparkContext
sc = SparkContext.getOrCreate(SparkConf().setMaster("local[*]"))
sc1 = SparkSession.builder.appName('flatmap() pyspark').getOrCreate()
data = ["a;1","b;2","c;3","d;4","e;5"]
a = sc1.sparkContext.parallelize(data)
for element in a.collect():
print(element)
b=a.flatMap(lambda line:line.split(";"))
for element in b.collect():
print(element)
I tried to install the Pyspark, Pyspark functions in settings of project Interpreter in pycharm and checked the environment variables too and those also fine. Actually I tried to run word count program but not worked.
Thank you!