read textfile in pyspark2

Question

I am trying to read a text file in spark 2.3 using python,but I get this error. This is the format textFile is in:

name marks
amar 100
babul 70
ram 98
krish 45

Code:

df=spark.read.option("header","true")\
    .option("delimiter"," ")\
    .option("inferSchema","true")\
    .schema(
        StructType(
            [
                StructField("Name",StringType()),
                StructField("marks",IntegerType())
            ]
        )
    )\
    .text("file:/home/maria_dev/prac.txt")

Error:

java.lang.AssertionError: assertion failed: Text data source only
produces a single data column named "value"

While I am trying to read a textFile into an RDD, its being collected as a single column.

Should the data file should be changed or shoud I change my code?

notNull · Accepted Answer · 2018-09-18T02:19:58.403

5

Instead of .text(produces only single value column) use .csv to load file into DF.

>>> df=spark.read.option("header","true")\
    .option("delimiter"," ")\
    .option("inferSchema","true")\
    .schema(
        StructType(
            [
                StructField("Name",StringType()),
                StructField("marks",IntegerType())
            ]
        )
    )\
    .csv('file:///home/maria_dev/prac.txt') 

>>> from pyspark.sql.types import *
>>> df
DataFrame[Name: string, marks: int]
>>> df.show(10,False)
+-----+-----+
|Name |marks|
+-----+-----+
|amar |100  |
|babul|70   |
|ram  |98   |
|krish|45   |
+-----+-----+

edited Sep 18 '18 at 02:19

answered Sep 18 '18 at 02:06

notNull

30,258
4
35
50

Thank you for the info on .csv, but the problem is in the textFile there are no commas used, as mentioned above the code section. It is a just a textfile with spaces differentiating column values and newLine for rows. – abhishek anand Sep 18 '18 at 13:28
@abhishekanand, As we are using **delimiter as " "(space)** option while loading the csv file, so df dataframe will load the data from the csv file **with space delimiter**. Then you are going to have **Name,marks columns** in the df. – notNull Sep 18 '18 at 13:34

read textfile in pyspark2

1 Answers1