14

Perhaps this is well documented, but I am getting very confused how to do this (there are many Apache tools).

When I create an SQL table, I create the table using the following commands:

CREATE TABLE table_name(
   column1 datatype,
   column2 datatype,
   column3 datatype,
   .....
   columnN datatype,
   PRIMARY KEY( one or more columns )
);

How does one convert this exist table into Parquet? This file is written to disk? If the original data is several GB, how long does one have to wait?

Could I format the original raw data into Parquet format instead?

Nzbuu
  • 5,241
  • 1
  • 29
  • 51
ShanZhengYang
  • 16,511
  • 49
  • 132
  • 234
  • 4
    For the record, Sqoop has no real future now that Spark is gaining momentum as a "general purpose ETL engine for Hadoop and beyond" *(Sqoop was developed by Cloudera as a stop-gap solution, but Cloudera is now championing Spark...)* – Samson Scharfrichter Jan 09 '17 at 09:11

3 Answers3

17

Apache Spark can be used to do this:

1.load your table from mysql via jdbc
2.save it as a parquet file

Example:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.read.jdbc("YOUR_MYSQL_JDBC_CONN_STRING",  "YOUR_TABLE",properties={"user": "YOUR_USER", "password": "YOUR_PASSWORD"})
df.write.parquet("YOUR_HDFS_FILE")
liprais
  • 194
  • 2
  • 6
  • Excellent! Thanks for this – ShanZhengYang Apr 28 '17 at 02:30
  • How do we check the progress of this save? This command is stuck from half hour in my machine?? the size of the db is 44 gb. – Viv Jun 07 '17 at 13:42
  • @Viv you can check the spark job web page for detailed task info,normally it runs at localhost:8080 – liprais Jun 08 '17 at 05:17
  • 1
    My personal experience tells that to be able to use this solution, you need to be aware of these two gotchas: 1. **No of parallel queries** your `MySQL` DB instance can handle 2. The **`JDBC` driver** that you use with `Spark`, should be able to use a **high `fetchSize`** (like 10k) – y2k-shubham Apr 12 '18 at 06:43
  • The connection works, but when trying to pull the whole table from mysql, I got: java.sql.SQLException: GC overhead limit exceeded. Any of you have the same issue? – kennyut Feb 08 '19 at 19:43
  • I get the following error when doing this on my mac; let me know of any suggestions: java.sql.SQLException: No suitable driver – Gadzair May 19 '20 at 13:50
4

The odbc2parquet command line tool might also be helpful in some situations.

odbc2parquet \
-vvv \ # Log output, good to know it is still doing something during large downloads
query \ # Subcommand for accessing data and storing it
--connection-string ${ODBC_CONNECTION_STRING} \
--batch-size 100000 \ # Batch size in rows
--batches-per-file 100 \ # Ommit to store entire query in a single file
out.par \ # Path to output parquet file
"SELECT * FROM YourTable"
Markus Klein
  • 1,202
  • 12
  • 10
0

Another memory-saving option is to use streaming SQL query results with SQLAlchemy and gradually append them to the parquet file with PyArrow. In SQLAlchemy set your Connection.execution_options.stream_results = True and then fetch rows in chunks with Result.yield_per(). For each iteration put fetched rows to pyarrow.Table and use pyarrow.parquet.ParquetWriter.write_table to append it to the file

Don Nillo
  • 29
  • 2