How to convert a csv file to parquet

Question

I'm new to BigData. I need to convert a .csv/.txt file to Parquet format. I searched a lot but couldn't find any direct way to do so. Is there any way to achieve that?

ostrokach · Answer 1 · 2020-10-13T14:38:33.193

I already posted an answer on how to do this using Apache Drill. However, if you are familiar with Python, you can now do this using Pandas and PyArrow!

Install dependencies

Using pip:

pip install pandas pyarrow

or using conda:

conda install pandas pyarrow -c conda-forge

Convert CSV to Parquet in chunks

# csv_to_parquet.py

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

csv_file = '/path/to/my.tsv'
parquet_file = '/path/to/my.parquet'
chunksize = 100_000

csv_stream = pd.read_csv(csv_file, sep='\t', chunksize=chunksize, low_memory=False)

for i, chunk in enumerate(csv_stream):
    print("Chunk", i)
    if i == 0:
        # Guess the schema of the CSV file from the first chunk
        parquet_schema = pa.Table.from_pandas(df=chunk).schema
        # Open a Parquet file for writing
        parquet_writer = pq.ParquetWriter(parquet_file, parquet_schema, compression='snappy')
    # Write CSV chunk to the parquet file
    table = pa.Table.from_pandas(chunk, schema=parquet_schema)
    parquet_writer.write_table(table)

parquet_writer.close()

I haven't benchmarked this code against the Apache Drill version, but in my experience it's plenty fast, converting tens of thousands of rows per second (this depends on the CSV file of course!).

Edit:

We can now read CSV files directly into PyArrow Tables using pyarrow.csv.read_csv. This is probably faster than using the Pandas CSV reader, although it may be less flexible.

Why is it less flexible? (Sorry, I don't have experience working with `pyarrow`, just got curious seeing your comment) — sphoenix, Dec 23 '21 at 13:16
@sphoenix I was mostly refering to the number of parameters accepted by the [`pd.read_csv`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) and [`pyarrow.csv.read_csv`](https://arrow.apache.org/docs/python/generated/pyarrow.csv.read_csv.html) methods. To give a specific example, the case of `pd.read_csv`, `sep="..."` can be a regular expression, while in the case of `pyarrow.csv.read_csv`, `delimiter="..."` has to be a single character. — ostrokach, Dec 23 '21 at 18:40
There is risk of schema error in this code as the schema of the entire CSV is being inferred from the first chunk of data. The first chunk may falsely indicate a column as int, but the last chunk may contain decimal or empty values, resulting in error when writing the parquet file. The solution would be to infer data types based on phased reading of the entire CSV before creating the parquet file schema. The proposed code is in [this answer](https://stackoverflow.com/a/74871381) — the_RR, Dec 22 '22 at 13:47

Pranav Gupta · Answer 2 · 2020-02-27T17:23:15.043

35

[For Python]

Pandas now has direct support for it.

Just read the csv file into dataframe by pandas using read_csv and writing that dataframe to parquet file using to_parquet.

edited Feb 27 '20 at 17:23

answered Feb 27 '20 at 16:02

Pranav Gupta

651
9
14

2

why would you offer python solution for a Java question? – Yehor Androsov Feb 27 '20 at 16:37
9

Because there was already one without mentioning to_parquet (as it was released with 0.21.0). Thought this might be useful for someone who requires a python based solution. – Pranav Gupta Feb 27 '20 at 17:31

score 18 · Answer 3 · answered Aug 01 '16 at 05:15

You can use Apache Drill, as described in Convert a CSV File to Apache Parquet With Drill.

In brief:

Start Apache Drill:

$ cd /opt/drill/bin
$ sqlline -u jdbc:drill:zk=local

Create the Parquet file:

-- Set default table format to parquet
ALTER SESSION SET `store.format`='parquet';

-- Create a parquet table containing all data from the CSV table
CREATE TABLE dfs.tmp.`/stats/airport_data/` AS
SELECT
CAST(SUBSTR(columns[0],1,4) AS INT)  `YEAR`,
CAST(SUBSTR(columns[0],5,2) AS INT) `MONTH`,
columns[1] as `AIRLINE`,
columns[2] as `IATA_CODE`,
columns[3] as `AIRLINE_2`,
columns[4] as `IATA_CODE_2`,
columns[5] as `GEO_SUMMARY`,
columns[6] as `GEO_REGION`,
columns[7] as `ACTIVITY_CODE`,
columns[8] as `PRICE_CODE`,
columns[9] as `TERMINAL`,
columns[10] as `BOARDING_AREA`,
CAST(columns[11] AS DOUBLE) as `PASSENGER_COUNT`
FROM dfs.`/opendata/Passenger/SFO_Passenger_Data/*.csv`;

Try selecting data from the new Parquet file:

-- Select data from parquet table
SELECT *
FROM dfs.tmp.`/stats/airport_data/*`

You can change the dfs.tmp location by going to http://localhost:8047/storage/dfs (source: CSV and Parquet).

I confirm this is the best and easiest way to achieve this. Apache Hive could be an alternative too. — Thomas Decaux, Oct 06 '16 at 22:04

score 7 · Answer 4 · edited Nov 02 '17 at 04:55

The following code is an example using spark2.0. Reading is much faster than inferSchema option. Spark 2.0 convert into parquet file in much more efficient than spark1.6.

import org.apache.spark.sql.types._
var df = StructType(Array(StructField("timestamp", StringType, true),StructField("site", StringType, true),StructField("requests", LongType, true) ))
df = spark.read
          .schema(df)
          .option("header", "true")
          .option("delimiter", "\t")
          .csv("/user/hduser/wikipedia/pageviews-by-second-tsv")
df.write.parquet("/user/hduser/wikipedia/pageviews-by-second-parquet")

score 5 · Answer 5 · edited Mar 01 '23 at 21:29

5

I made a small command line tool to convert CSV to Parquet: csv2parquet

edited Mar 01 '23 at 21:29

Cornelius Roemer

3,772
1
24
55

answered Nov 16 '21 at 03:13

dominik

5,745
6
34
45

score 3 · Answer 6 · edited Mar 08 '17 at 14:30

3

1) You can create an external hive table

create  external table emp(name string,job_title string,department string,salary_per_year int)
row format delimited
fields terminated by ','
location '.. hdfs location of csv file '

2) Another hive table that will store parquet file

create  external table emp_par(name string,job_title string,department string,salary_per_year int)
row format delimited
stored as PARQUET
location 'hdfs location were you want the save parquet file'

Insert the table one data into table two :

insert overwrite table emp_par select * from emp

edited Mar 08 '17 at 14:30

sumitya

2,631
1
19
32

answered Jan 18 '17 at 16:44

Hemant Kumar

63
1
1
7

2

Table emp_par has been created as EXTERNAL TABLE. This should have been created as normal table or else you can't insert data into it. – Jai Prakash May 29 '17 at 23:55

score 2 · Answer 7 · answered Jul 16 '16 at 07:10

Read csv files as Dataframe in Apache Spark with spark-csv package. after loading data to Dataframe save dataframe to parquetfile.

val df = sqlContext.read
      .format("com.databricks.spark.csv")
      .option("header", "true")
      .option("inferSchema", "true")
      .option("mode", "DROPMALFORMED")
      .load("/home/myuser/data/log/*.csv")
df.saveAsParquetFile("/home/myuser/data.parquet")

score 2 · Answer 8 · answered Aug 10 '21 at 07:36

2

You can use the csv2parquet tool from https://github.com/fraugster/parquet-go project. It is much simpler to use than Apache Drill

answered Aug 10 '21 at 07:36

Alexey Sviridov

3,360
28
33

score 1 · Answer 9 · answered Feb 12 '22 at 16:32

1

import pyspark

sc = pyspark.SparkContext('local[*]')
sqlContext = pyspark.sql.SQLContext(sc)

df = sqlContext.read.csv('file:///xxx/xxx.csv')
df.write.parquet('file:///xxx/output.parquet')

answered Feb 12 '22 at 16:32

haha1903

76
2

While this code may answer the question, providing additional context regarding how and/or why it solves the problem would improve the answer's long-term value. You can find more information on how to write good answers in the help center: stackoverflow.com/help/how-to-answer . – abhiieor Feb 12 '22 at 17:56

score 0 · Answer 10 · answered Sep 30 '17 at 08:59

from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.types import *
import sys

sc = SparkContext(appName="CSV2Parquet")
sqlContext = SQLContext(sc)

schema = StructType([
    StructField("col1", StringType(), True),
    StructField("col2", StringType(), True),
    StructField("col3", StringType(), True),
    StructField("col4", StringType(), True),
    StructField("col5", StringType(), True)])
rdd = sc.textFile('/input.csv').map(lambda line: line.split(","))
df = sqlContext.createDataFrame(rdd, schema)
df.write.parquet('/output.parquet')

How to convert a csv file to parquet

10 Answers10

Install dependencies

Convert CSV to Parquet in chunks

Linked