Questions tagged [pyspark-schema]
68 questions
3
votes
2 answers
Update a specific value when 2 other values matches from 2 different tables in PySpark
Any idea how to write this in PySpark?
I have two PySpark DataFrames that i'm trying to union. However, there is 1 value that I want to update based on 2 duplicate column values.
PyDf1:
+-----------+-----------+-----------+------------+
|test_date …

Mick
- 265
- 2
- 10
3
votes
1 answer
How to create dataframe with struct column in PySpark without specifying a schema?
I am learning PySpark and it is convenient to be able to quickly create example dataframes to try the functionality of the PySpark API.
The following code (where spark is a spark session):
import pyspark.sql.types as T
df = [{'id': 1, 'data': {'x':…

karpan
- 421
- 1
- 5
- 13
3
votes
1 answer
how to change a column type in array struct by pyspark
how to change a column type in array struct by pyspark, for example, I would like to change userid from int to long
root
|-- id: string (nullable = true)
|-- numbers: array (nullable = true)
| |-- element: struct (containsNull = true)
…

Frank
- 977
- 3
- 14
- 35
2
votes
1 answer
Is there any way to convert flatten Dataframe to nested Dataframe using Pyspark?
I have the following dataframe with the schema:
+------+--------+--------+----------+----------+-------+----------+------+--------------+-------+
|emp_id|emp_name|job_name|manager_id| hire_date|…

D Das
- 31
- 1
2
votes
2 answers
PySpark read JSON with custom nested schema doesn't apply
I have this simple JSON file:
{"adas":{"parkAssist":{"rear":{"alarm":false,"muted":false},"front":{"alarm":false,"muted":false}},"lane":{"keepAssist":{"right":false,"left":false}}}}
But when I'm trying to read it like…

Valéry
- 31
- 5
2
votes
0 answers
PySpark Lag function based on condition
I am new to PySpark and have been trying a few stuff.
I have a data frame as follows
+----------+-----------+
| Column1| Column2|
+----------+-----------+
| VALUE1| 30000|
| VALUE2| 25000|
| VALUE3| 20000|
| VALUE4| …

SamaAdi
- 41
- 1
- 6
2
votes
2 answers
Update a highly nested column from string to struct
|-- x: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- y: long (nullable = true)
| | |-- z: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |--…

Chirag Sejpal
- 877
- 2
- 9
- 17
2
votes
2 answers
Specifying column with multiple datatypes in Spark Schema
I am trying to create schema to parse json into spark dataframe
I have column value in json which could be either struct or string
"value": {
"entity-type": "item",
"id": "someid",
"numeric-id": 30
}
"value": "SomePicture.jpg",
How…

Neha Zaveri
- 21
- 5
1
vote
2 answers
Selecting a column with backtick in its name - AnalysisException: cannot resolve Column
I have a data frame which has the below column:
Last Login- Date & Time(Incl. Time Zone)
When I read the data and print the schema, the column gets printed
df.printSchema()
But when I try selecting the column from the data frame it…

Jim Macaulay
- 4,709
- 4
- 28
- 53
1
vote
1 answer
How to replace null value with some value using coalesce in pyspark
I have two files :- orders_renamed.csv , customers.csv
I am joining them with full outer join and then dropping same column (customer_id).
I want to replace null vaue to "-1" in "order_id" column.
I have tried this:
from pyspark.sql.functions import…

Vivek Mishra
- 23
- 3
1
vote
1 answer
how to define Schema for semi - structured text file in pysparK
1 2013-07-25 11599,CLOSED
2 2013-07-25 256,PENDING_PAYMENT
3 2013-07-25 12111,COMPLETE
4 2013-07-25 8827,CLOSED
5 2013-07-25 11318,COMPLETE
6 2013-07-25 7130,COMPLETE
7 2013-07-25 4530,COMPLETE
8 2013-07-25 2911,PROCESSING
9…

Vivek Mishra
- 23
- 3
1
vote
1 answer
Pyspark nested json with dynamical column names into one column
Could you help me? I need from this JSONL data:
{"id": 1, "data": {"key:1": {"string_value": "value_1"}, "key:2": {"string_value": "value_2"}, "user_id": {"string_value": "value_4"}}}
{"id": 2, "data": {"key:3": {"string_value": "value_3"},…

zigi
- 21
- 2
1
vote
1 answer
Getting nulls while selecting a dataframe from a JSON file in PySpark
I am using spark 3.1 and trying to read a JSON file
I have defined the schema for below file as:
StructType([
StructField('search_metadata', MapType(StringType(),StringType())),
StructField('search_parameters',…

Xi12
- 939
- 2
- 14
- 27
1
vote
1 answer
Data Frames being read in with varying number of columns, how do I dynamically change data types of only columns that are Boolean to String data type?
In my notebook, I have Data Frames being read in that will have a variable number of columns every time the notebook is ran. How do I dynamically change the data types of only the columns that are Boolean data types to String data type?
This is a…

JTD2021
- 127
- 2
- 12
1
vote
0 answers
A schema mismatch detected when writing to the Delta table Data stream write
I am having .option("mergeSchema", "true") in my code still I am getting schema mismatch error. I am reading schema for parquet my timestamp was in bigint format so I converted to timestamp format and then created new column date which I want to…

Manav Jain
- 21
- 2