Get field values from a structtype in pyspark dataframe

Question

I have to get the schema from a csv file (the column name and datatype).I have reached so far -

l = [('Alice', 1)]
Person = Row('name', 'age')
rdd = sc.parallelize(l)
person = rdd.map(lambda r: Person(*r))
df2 = spark.createDataFrame(person)
print(df2.schema)
#StructType(List(StructField(name,StringType,true),StructField(age,LongType,true)))

I want to extract the values name and age along with StringType and LongType however I don't see any method on struct type.

There's toDDL method of struct type in scala but the same is not available for python.

This is an extension of the mentioned question where I already got help , however I wanted to create a new thread - Get dataframe schema load to metadata table

Thanks for the reply , I am updating the full code -

import pyspark # only run after findspark.init()
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .config("spark.sql.catalogImplementation", "in-memory") \
    .getOrCreate()
from pyspark.sql import Row
l = [('Alice', 1)]
Person = Row('name', 'age')
rdd = sc.parallelize(l)
person = rdd.map(lambda r: Person(*r))
df2 = spark.createDataFrame(person)
df3=df2.dtypes
df1=spark.createDataFrame(df3, ['colname', 'datatype'])
df1.show()
df1.createOrReplaceTempView("test")
spark.sql('''select * from test ''').show()

Output

+-------+--------+
|colname|datatype|
+-------+--------+
|   name|  string|
|    age|  bigint|
+-------+--------+

+-------+--------+
|colname|datatype|
+-------+--------+
|   name|  string|
|    age|  bigint|
+-------+--------+

I am confused by your latest edit - did you answer your own question? — pault, Jul 01 '19 at 18:59

score 4 · Answer 1 · answered Jul 01 '19 at 17:59

4

IIUC, you can loop over the values in df2.schema.fields and get the name and dataType:

print([(x.name, x.dataType) for x in df2.schema.fields])
#[('name', StringType), ('age', LongType)]

There is also dtypes:

print(df2.dtypes)
#[('name', 'string'), ('age', 'bigint')]

and you may also be interested in printSchema():

df2.printSchema()
#root
# |-- name: string (nullable = true)
# |-- age: long (nullable = true)

answered Jul 01 '19 at 17:59

pault

41,343
15
107
149

1

what about functions like .toDDL() from scala in PySpark – InLaw Apr 29 '20 at 15:38
you could try something like: ddl = spark.sparkContext._jvm.org.apache.spark.sql.types.DataType.fromJson(schema_json).toDDL() – Boris Dec 14 '20 at 16:10

Get field values from a structtype in pyspark dataframe

1 Answers1