PySpark parquet datatypes

Question

I am using PySpark to read a relative large csv file (~10GB):

ddf = spark.read.csv('directory/my_file.csv')

All the columns have the datatype string

After changing the datatype of for example column_a I can see the datatype changed to an integer. If I write the ddf to a parquet file and read the parquet file I notice that all columns have the datatype string again. Question: How can I make sure the parquet file contains the correct datatypes so that I do not have to change the datatype again (while reading the parquet file).

Notes:

I write the ddf as a parquet file as follows:

ddf.repartition(10).write.parquet('directory/my_parquet_file', mode='overwrite')

I use:

PySpark version 2.0.0.2
Python 3.x

score 0 · Answer 1 · answered Jun 01 '18 at 11:12

0

I read my large files with pandas and not have this problem. Try use pandas. http://pandas.pydata.org/pandas-docs/version/0.23/generated/pandas.read_csv.html

In[1]: Import pandas as pd

In[2]: df = pd.read_csv('directory/my_file.csv')

answered Jun 01 '18 at 11:12

Lucas Fagundes

185
3
14

1

I have already used Pandas to read the file. However, due to the size of the data file and the calculations I have to do, I want to use PySpark to do the trick. Is there a way to read/write parquet files and 'save' the dtypes into the parquet file? – ptphdev Jun 04 '18 at 09:35
try use huge.https://stackoverflow.com/questions/17444679/reading-a-huge-csv-file – Lucas Fagundes Jun 05 '18 at 11:36

PySpark parquet datatypes

1 Answers1