I want to read json
file. Right now, I am doing the following logic, which is not that dynamic.
df = spark.read.option("multiline", True).json(loc)
df = df.select("data.*", "event.*", "resource_id", "resource_kind", "resource_uri")
I will have to write column.*
multiple times as the file is heavily nested, it has multiple StructType
The schema of the same is as below:
root
|-- data: struct (nullable = true)
| |-- accounts: struct (nullable = true)
| | |-- accounting_reference_date: struct (nullable = true)
| | | |-- day: string (nullable = true)
| | | |-- month: string (nullable = true)
| | |-- last_accounts: struct (nullable = true)
| | | |-- made_up_to: string (nullable = true)
| | | |-- period_end_on: string (nullable = true)
| | | |-- period_start_on: string (nullable = true)
| | | |-- type: string (nullable = true)
| | |-- next_accounts: struct (nullable = true)
| | | |-- due_on: string (nullable = true)
| | | |-- overdue: boolean (nullable = true)
| | | |-- period_end_on: string (nullable = true)
| | | |-- period_start_on: string (nullable = true)
| | |-- next_due: string (nullable = true)
| | |-- next_made_up_to: string (nullable = true)
| | |-- overdue: boolean (nullable = true)
| |-- can_file: boolean (nullable = true)
| |-- company_name: string (nullable = true)
| |-- company_number: string (nullable = true)
| |-- company_status: string (nullable = true)
| |-- confirmation_statement: struct (nullable = true)
| | |-- last_made_up_to: string (nullable = true)
| | |-- next_due: string (nullable = true)
| | |-- next_made_up_to: string (nullable = true)
| | |-- overdue: boolean (nullable = true)
| |-- date_of_creation: string (nullable = true)
| |-- etag: string (nullable = true)
| |-- has_charges: boolean (nullable = true)
| |-- is_community_interest_company: boolean (nullable = true)
| |-- jurisdiction: string (nullable = true)
| |-- last_full_members_list_date: string (nullable = true)
| |-- links: struct (nullable = true)
| | |-- charges: string (nullable = true)
| | |-- filing_history: string (nullable = true)
| | |-- officers: string (nullable = true)
| | |-- persons_with_significant_control: string (nullable = true)
| | |-- persons_with_significant_control_statements: string (nullable = true)
| | |-- registers: string (nullable = true)
| | |-- self: string (nullable = true)
| |-- previous_company_names: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- ceased_on: string (nullable = true)
| | | |-- effective_from: string (nullable = true)
| | | |-- name: string (nullable = true)
| |-- registered_office_address: struct (nullable = true)
| | |-- address_line_1: string (nullable = true)
| | |-- address_line_2: string (nullable = true)
| | |-- country: string (nullable = true)
| | |-- locality: string (nullable = true)
| | |-- po_box: string (nullable = true)
| | |-- postal_code: string (nullable = true)
| | |-- region: string (nullable = true)
| |-- registered_office_is_in_dispute: boolean (nullable = true)
| |-- sic_codes: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- subtype: string (nullable = true)
| |-- type: string (nullable = true)
|-- event: struct (nullable = true)
| |-- published_at: string (nullable = true)
| |-- timepoint: long (nullable = true)
| |-- type: string (nullable = true)
|-- resource_id: string (nullable = true)
|-- resource_kind: string (nullable = true)
|-- resource_uri: string (nullable = true)
As few of the fields are having same names, I need to capture the field name from root.
For eg. field period_start_on
is present in both last_accounts
and next_accounts
.
So, I need to make the column name as below:
data.accounts.last_accounts.period_start_on
data.accounts.next_accounts.period_start_on
I think the approach I am taking wont take me longer. Could you please suggest the effective way of reading the json. Also how can we identify 2 fields having same name.
Thank you