How to convert an 500GB SQL table into Apache Parquet?

Question

Perhaps this is well documented, but I am getting very confused how to do this (there are many Apache tools).

When I create an SQL table, I create the table using the following commands:

CREATE TABLE table_name(
   column1 datatype,
   column2 datatype,
   column3 datatype,
   .....
   columnN datatype,
   PRIMARY KEY( one or more columns )
);

How does one convert this exist table into Parquet? This file is written to disk? If the original data is several GB, how long does one have to wait?

Could I format the original raw data into Parquet format instead?

For the record, Sqoop has no real future now that Spark is gaining momentum as a "general purpose ETL engine for Hadoop and beyond" *(Sqoop was developed by Cloudera as a stop-gap solution, but Cloudera is now championing Spark...)* — Samson Scharfrichter, Jan 09 '17 at 09:11

score 17 · Accepted Answer · answered Apr 27 '17 at 03:34

17

Apache Spark can be used to do this:

1.load your table from mysql via jdbc
2.save it as a parquet file

Example:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.read.jdbc("YOUR_MYSQL_JDBC_CONN_STRING",  "YOUR_TABLE",properties={"user": "YOUR_USER", "password": "YOUR_PASSWORD"})
df.write.parquet("YOUR_HDFS_FILE")

answered Apr 27 '17 at 03:34

liprais

194
2
6

Excellent! Thanks for this – ShanZhengYang Apr 28 '17 at 02:30
How do we check the progress of this save? This command is stuck from half hour in my machine?? the size of the db is 44 gb. – Viv Jun 07 '17 at 13:42
@Viv you can check the spark job web page for detailed task info,normally it runs at localhost:8080 – liprais Jun 08 '17 at 05:17
1

My personal experience tells that to be able to use this solution, you need to be aware of these two gotchas: 1. **No of parallel queries** your `MySQL` DB instance can handle 2. The **`JDBC` driver** that you use with `Spark`, should be able to use a **high `fetchSize`** (like 10k) – y2k-shubham Apr 12 '18 at 06:43
The connection works, but when trying to pull the whole table from mysql, I got: java.sql.SQLException: GC overhead limit exceeded. Any of you have the same issue? – kennyut Feb 08 '19 at 19:43
I get the following error when doing this on my mac; let me know of any suggestions: java.sql.SQLException: No suitable driver – Gadzair May 19 '20 at 13:50

score 4 · Answer 2 · answered Dec 16 '20 at 00:04

The odbc2parquet command line tool might also be helpful in some situations.

odbc2parquet \
-vvv \ # Log output, good to know it is still doing something during large downloads
query \ # Subcommand for accessing data and storing it
--connection-string ${ODBC_CONNECTION_STRING} \
--batch-size 100000 \ # Batch size in rows
--batches-per-file 100 \ # Ommit to store entire query in a single file
out.par \ # Path to output parquet file
"SELECT * FROM YourTable"

score 0 · Answer 3 · answered Feb 21 '23 at 15:51

Another memory-saving option is to use streaming SQL query results with SQLAlchemy and gradually append them to the parquet file with PyArrow. In SQLAlchemy set your Connection.execution_options.stream_results = True and then fetch rows in chunks with Result.yield_per(). For each iteration put fetched rows to pyarrow.Table and use pyarrow.parquet.ParquetWriter.write_table to append it to the file

How to convert an 500GB SQL table into Apache Parquet?

3 Answers3

Linked