3

I have to get the schema from a csv file (the column name and datatype).I have reached so far -

l = [('Alice', 1)]
Person = Row('name', 'age')
rdd = sc.parallelize(l)
person = rdd.map(lambda r: Person(*r))
df2 = spark.createDataFrame(person)
print(df2.schema)
#StructType(List(StructField(name,StringType,true),StructField(age,LongType,true)))

I want to extract the values name and age along with StringType and LongType however I don't see any method on struct type.

There's toDDL method of struct type in scala but the same is not available for python.

This is an extension of the mentioned question where I already got help , however I wanted to create a new thread - Get dataframe schema load to metadata table

Thanks for the reply , I am updating the full code -

import pyspark # only run after findspark.init()
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .config("spark.sql.catalogImplementation", "in-memory") \
    .getOrCreate()
from pyspark.sql import Row
l = [('Alice', 1)]
Person = Row('name', 'age')
rdd = sc.parallelize(l)
person = rdd.map(lambda r: Person(*r))
df2 = spark.createDataFrame(person)
df3=df2.dtypes
df1=spark.createDataFrame(df3, ['colname', 'datatype'])
df1.show()
df1.createOrReplaceTempView("test")
spark.sql('''select * from test ''').show()

Output

+-------+--------+
|colname|datatype|
+-------+--------+
|   name|  string|
|    age|  bigint|
+-------+--------+

+-------+--------+
|colname|datatype|
+-------+--------+
|   name|  string|
|    age|  bigint|
+-------+--------+
pault
  • 41,343
  • 15
  • 107
  • 149
pratik rudra
  • 137
  • 1
  • 3
  • 10

1 Answers1

4

IIUC, you can loop over the values in df2.schema.fields and get the name and dataType:

print([(x.name, x.dataType) for x in df2.schema.fields])
#[('name', StringType), ('age', LongType)]

There is also dtypes:

print(df2.dtypes)
#[('name', 'string'), ('age', 'bigint')]

and you may also be interested in printSchema():

df2.printSchema()
#root
# |-- name: string (nullable = true)
# |-- age: long (nullable = true)
pault
  • 41,343
  • 15
  • 107
  • 149
  • 1
    what about functions like .toDDL() from scala in PySpark – InLaw Apr 29 '20 at 15:38
  • you could try something like: ddl = spark.sparkContext._jvm.org.apache.spark.sql.types.DataType.fromJson(schema_json).toDDL() – Boris Dec 14 '20 at 16:10