How does inferSchema for PySpark really work?

Asked May 06 '23 at 15:52

Active May 06 '23 at 15:52

Viewed 62 times

Reading the documentation https://spark.apache.org/docs/latest/sql-data-sources-csv.html#data-source-option, it is very clear that inferSchema "Infers the input schema automatically from data" however, my code is not working to infer the data types. I have even tried enforceSchema, but nothing worked. In Excel, I notice all the data is 'General" type; is that the reason behind the hiccup?

asked May 06 '23 at 15:52

RoiMinuit

Does this answer your question? [Spark Option: inferSchema vs header = true](https://stackoverflow.com/questions/56927329/spark-option-inferschema-vs-header-true) – BeRT2me May 06 '23 at 21:05
@BeRT2me unfortunately, no. I came across that post before I posted my question. I tried removing the header argument, tried 'enforceSchema'...nothing is working. All the data is being imported as string types – RoiMinuit May 07 '23 at 18:04
what does your csv file look like? – ScootCork May 07 '23 at 19:11
@ScootCork it's 9GiB. 65 columns and 3M+ rows – RoiMinuit May 08 '23 at 02:19
Could you provide a sample to reproduce? – ScootCork May 08 '23 at 06:02
@ScootCork I do not know how to do that, but will look it up. However, here is the URL with the data in the meantime: https://www.kaggle.com/datasets/ananaymital/us-used-cars-dataset – RoiMinuit May 09 '23 at 01:38

How does inferSchema for PySpark really work?

0 Answers0