Questions tagged [pyspark-schema]

68 questions
3
votes
2 answers

Update a specific value when 2 other values matches from 2 different tables in PySpark

Any idea how to write this in PySpark? I have two PySpark DataFrames that i'm trying to union. However, there is 1 value that I want to update based on 2 duplicate column values. PyDf1: +-----------+-----------+-----------+------------+ |test_date …
Mick
  • 265
  • 2
  • 10
3
votes
1 answer

How to create dataframe with struct column in PySpark without specifying a schema?

I am learning PySpark and it is convenient to be able to quickly create example dataframes to try the functionality of the PySpark API. The following code (where spark is a spark session): import pyspark.sql.types as T df = [{'id': 1, 'data': {'x':…
karpan
  • 421
  • 1
  • 5
  • 13
3
votes
1 answer

how to change a column type in array struct by pyspark

how to change a column type in array struct by pyspark, for example, I would like to change userid from int to long root |-- id: string (nullable = true) |-- numbers: array (nullable = true) | |-- element: struct (containsNull = true) …
Frank
  • 977
  • 3
  • 14
  • 35
2
votes
1 answer

Is there any way to convert flatten Dataframe to nested Dataframe using Pyspark?

I have the following dataframe with the schema: +------+--------+--------+----------+----------+-------+----------+------+--------------+-------+ |emp_id|emp_name|job_name|manager_id| hire_date|…
D Das
  • 31
  • 1
2
votes
2 answers

PySpark read JSON with custom nested schema doesn't apply

I have this simple JSON file: {"adas":{"parkAssist":{"rear":{"alarm":false,"muted":false},"front":{"alarm":false,"muted":false}},"lane":{"keepAssist":{"right":false,"left":false}}}} But when I'm trying to read it like…
Valéry
  • 31
  • 5
2
votes
0 answers

PySpark Lag function based on condition

I am new to PySpark and have been trying a few stuff. I have a data frame as follows +----------+-----------+ | Column1| Column2| +----------+-----------+ | VALUE1| 30000| | VALUE2| 25000| | VALUE3| 20000| | VALUE4| …
SamaAdi
  • 41
  • 1
  • 6
2
votes
2 answers

Update a highly nested column from string to struct

|-- x: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- y: long (nullable = true) | | |-- z: array (nullable = true) | | | |-- element: struct (containsNull = true) | | | | |--…
2
votes
2 answers

Specifying column with multiple datatypes in Spark Schema

I am trying to create schema to parse json into spark dataframe I have column value in json which could be either struct or string "value": { "entity-type": "item", "id": "someid", "numeric-id": 30 } "value": "SomePicture.jpg", How…
1
vote
2 answers

Selecting a column with backtick in its name - AnalysisException: cannot resolve Column

I have a data frame which has the below column: Last Login- Date & Time(Incl. Time Zone) When I read the data and print the schema, the column gets printed df.printSchema() But when I try selecting the column from the data frame it…
Jim Macaulay
  • 4,709
  • 4
  • 28
  • 53
1
vote
1 answer

How to replace null value with some value using coalesce in pyspark

I have two files :- orders_renamed.csv , customers.csv I am joining them with full outer join and then dropping same column (customer_id). I want to replace null vaue to "-1" in "order_id" column. I have tried this: from pyspark.sql.functions import…
1
vote
1 answer

how to define Schema for semi - structured text file in pysparK

1 2013-07-25 11599,CLOSED 2 2013-07-25 256,PENDING_PAYMENT 3 2013-07-25 12111,COMPLETE 4 2013-07-25 8827,CLOSED 5 2013-07-25 11318,COMPLETE 6 2013-07-25 7130,COMPLETE 7 2013-07-25 4530,COMPLETE 8 2013-07-25 2911,PROCESSING 9…
1
vote
1 answer

Pyspark nested json with dynamical column names into one column

Could you help me? I need from this JSONL data: {"id": 1, "data": {"key:1": {"string_value": "value_1"}, "key:2": {"string_value": "value_2"}, "user_id": {"string_value": "value_4"}}} {"id": 2, "data": {"key:3": {"string_value": "value_3"},…
zigi
  • 21
  • 2
1
vote
1 answer

Getting nulls while selecting a dataframe from a JSON file in PySpark

I am using spark 3.1 and trying to read a JSON file I have defined the schema for below file as: StructType([ StructField('search_metadata', MapType(StringType(),StringType())), StructField('search_parameters',…
Xi12
  • 939
  • 2
  • 14
  • 27
1
vote
1 answer

Data Frames being read in with varying number of columns, how do I dynamically change data types of only columns that are Boolean to String data type?

In my notebook, I have Data Frames being read in that will have a variable number of columns every time the notebook is ran. How do I dynamically change the data types of only the columns that are Boolean data types to String data type? This is a…
JTD2021
  • 127
  • 2
  • 12
1
vote
0 answers

A schema mismatch detected when writing to the Delta table Data stream write

I am having .option("mergeSchema", "true") in my code still I am getting schema mismatch error. I am reading schema for parquet my timestamp was in bigint format so I converted to timestamp format and then created new column date which I want to…
1
2 3 4 5