PySpark's read.csv is messing up the format of CSV file

Question

I have downloaded the data from kaggle. following is the link:- https://www.kaggle.com/datasets/utkarshx27/motor-vehicle-collisions

I am using following command to read the CSV:-

data = spark.read.csv('Data/Motor_Vehicle_Collisions_-_Crashes.csv', inferSchema=True, header=True)

I am getting following schema:-

Pyspark corrupted data

Please help me in resolving the issue above

Following is the expected output which I got from pandas read_csv command: Expected format

What command are you running to print spark dataframe? – Grimlock May 09 '23 at 13:49 — Grimlock, May 09 '23 at 13:49

score 0 · Answer 1 · answered May 09 '23 at 14:10

It looks like your terminal window is too small and the data isn't corrupted at all. Your terminal have line wrapping and so it looks weird but here's how it looks for me (with data.show()):

EDIT: your problem is also described here:

pyspark show dataframe as table with horizontal scroll in ipython notebook

PySpark's read.csv is messing up the format of CSV file

1 Answers1