16

I'm using AWS S3, Glue, and Athena with the following setup:

S3 --> Glue --> Athena

My raw data is stored on S3 as CSV files. I'm using Glue for ETL, and I'm using Athena to query the data.

Since I'm using Athena, I'd like to convert the CSV files to Parquet. I'm using AWS Glue to do this right now. This is the current process I'm using:

  1. Run Crawler to read CSV files and populate Data Catalog.
  2. Run ETL job to create Parquet file from Data Catalog.
  3. Run a Crawler to populate Data Catalog using Parquet file.

The Glue job only allows me to convert one table at a time. If I have many CSV files, this process quickly becomes unmanageable. Is there a better way, perhaps a "correct" way, of converting many CSV files to Parquet using AWS Glue or some other AWS service?

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
mark s.
  • 656
  • 2
  • 7
  • 14

5 Answers5

17

I had the exact same situation where I wanted to efficiently loop through the catalog tables catalogued by crawler which are pointing to csv files and then convert them to parquet. Unfortunately there is not much information available in the web yet. That's why I have written a blog in LinkedIn explaining how I have done it. Please have a read; specially point #5. Hope that helps. Please let me know your feedback.

Note: As per Antti's feedback, I am pasting the excerpt solution from my blog below:

  1. Iterating through catalog/database/tables

The Job Wizard comes with option to run predefined script on a data source. Problem is that the data source you can select is a single table from the catalog. It does not give you option to run the job on the whole database or a set of tables. You can modify the script later anyways but the way to iterate through the database tables in glue catalog is also very difficult to find. There are Catalog APIs but lacking suitable examples. The github example repo can be enriched with lot more scenarios to help developers.

After some mucking around, I came up with the script below which does the job. I have used boto3 client to loop through the table. I am pasting it here if it comes to someone’s help. I would also like to hear from you if you have a better suggestion

import sys
import boto3
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)


client = boto3.client('glue', region_name='ap-southeast-2')

databaseName = 'tpc-ds-csv'
print '\ndatabaseName: ' + databaseName

Tables = client.get_tables(DatabaseName=databaseName)

tableList = Tables['TableList']

for table in tableList:
    tableName = table['Name']
    print '\n-- tableName: ' + tableName

    datasource0 = glueContext.create_dynamic_frame.from_catalog(
        database="tpc-ds-csv", 
        table_name=tableName, 
        transformation_ctx="datasource0"
    )

    datasink4 = glueContext.write_dynamic_frame.from_options(
        frame=datasource0,
        connection_type="s3", 
        connection_options={
            "path": "s3://aws-glue-tpcds-parquet/"+ tableName + "/"
            },
        format="parquet",
        transformation_ctx="datasink4"
    )
job.commit()
Ilya Kisil
  • 2,490
  • 2
  • 17
  • 31
Tanveer Uddin
  • 1,520
  • 9
  • 15
  • 2
    I solved this issue quite some time ago. The solution mentioned in your linked blog post is almost identical to what I ended up doing. I hope AWS gets around to updating their Glue documentation. It is severely lacking at the moment. – mark s. Sep 19 '18 at 16:06
  • This is not an answer unless you actually provide at least a glimpse of the details in the answer itself. – Antti Haapala -- Слава Україні Dec 18 '18 at 10:18
  • 4
    You are right Antti. I was new contributor that time and still learning. I have edited the answer with the actual solution – Tanveer Uddin Dec 18 '18 at 10:36
4

Please refer to EDIT for updated info.

S3 --> Athena

Why not you use CSV format directly with Athena?

https://docs.aws.amazon.com/athena/latest/ug/supported-format.html

CSV is one of the supported formats. Also to make it efficient, you can compress multiple CSV files for faster loading.

Supported compression,

https://docs.aws.amazon.com/athena/latest/ug/compression-formats.html

Hope it helps.

EDIT:

Why Parquet format is more helpful than CSV?

https://dzone.com/articles/how-to-be-a-hero-with-powerful-parquet-google-and

S3 --> Glue --> Athena

More details on CSV to Parquet conversion,

https://aws.amazon.com/blogs/big-data/build-a-data-lake-foundation-with-aws-glue-and-amazon-s3/

Kannaiyan
  • 12,554
  • 3
  • 44
  • 83
  • 2
    I'm using Parquet to improve query performance and lower query costs when using Athena. – mark s. Apr 23 '18 at 19:31
  • 1
    Thanks for the insight. Sometimes questions are informative than answers. – Kannaiyan Apr 23 '18 at 20:49
  • The last link you've provided describes my current process--converting data for a single table to Parquet. I'm looking for a "best practices" or "easy-to-manage" method of doing this for many tables. – mark s. Apr 23 '18 at 21:26
4

I'm not a big fan of Glue, nor creating schemas from data

Here's how to do it in Athena, which is dramatically faster than Glue.

This is for the CSV files:

create table foo (
  id int,
  name string,
  some date
)
row format delimited
  fields terminated by ','
location 's3://mybucket/path/to/csvs/'

This is for the parquet files:

create table bar 
with (
  external_location = 's3://mybucket/path/to/parquet/',
  format = 'PARQUET'
)
as select * from foo 

You don't need to create that path for parquet, even if you use partitioning

Neil McGuigan
  • 46,580
  • 12
  • 123
  • 152
2

you can convert either JSON or CSV files into parquet directly, without importing it to the catalog first.

This is for the JSON files - the below code would convert anything hosted at the rawFiles directory

import sys
from awsglue.job import Job 
from awsglue.transforms import *
from awsglue.context import GlueContext
from pyspark.context import SparkContext
from awsglue.utils import getResolvedOptions

## @params: [JOB_NAME] args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sparkContext = SparkContext()
glueContext = GlueContext(sparkContext)
spark = glueContext.spark_session
job = Job(glueContext) job.init(args['JOB_NAME'], args)

s3_json_path = 's3://rawFiles/'  
s3_parquet_path = 's3://convertedFiles/'

output = spark.read.load(s3_json_path, format='json') 
output.write.parquet(s3_parquet_path)

job.commit()
Udi Yaffe
  • 21
  • 1
1

Sounds like in your step 1 you are crawling the individual csv file (e.g some-bucket/container-path/file.csv), but if you instead set your crawler to look at a path level instead of a file level (e.g some-bucket/container-path/) and all your csv files are uniform then the crawler should only create a single external table instead of an external table per file and you’ll be able to extract the data from all of the files at once.

kinzleb
  • 1,125
  • 1
  • 10
  • 8
  • 2
    The Crawler in step 1 is set to crawl a folder path. This creates a database which has multiple tables. The data for each table is stored as a CSV file. I'm trying to convert all of these CSV files to Parquet with a single script or job. In other words, I'd like to convert all CSV files for a given database to Parquet. AWS Glue only let's me choose one table per job. I'm looking for a method to efficiently do this for multiple tables. – mark s. Apr 23 '18 at 19:56
  • @mark - https://stackoverflow.com/users/5504459/mark-s, I am trying to achieve the same. Do you have a working solution you were able to make work? – nitinr708 Oct 18 '19 at 10:52
  • 1
    @nitinr708 My solution is likely out of date (e.g., `pandas` should now be available with Python shell Glue jobs). The basic approach is to loop over all of your csv files, read each into a dataframe, and then write to parquet. Pandas DFs, Glue DynamicFrames, and PySpark DFs are your options. Each has a different API for reading/writing DFs. These links should be helpful-- Glue: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-glue-context.html. PySpark: https://stackoverflow.com/a/45873742/5504459. Pandas: https://stackoverflow.com/a/37703861/5504459 – mark s. Oct 21 '19 at 19:48