1

I am trying to understand if there is any difference in the following approaches, in terms of memory usage, optimisation, parallelism etc.

Scenario: CSV files in an S3 bucket. 100 columns, more than 200 million rows in total

Read Option 1:

val df = spark.read.csv("s3://some-location")

Read Option 2: First create an external table in Hive using the s3 location and then read it in Spark

# Hive - Create External table
# Spark - val df = spark.sql("select * from external_table")

I would like to know what's the optimised way to read in such a scenario and what are the considerations.

Is one approach better than other when further transformations and actions are applied downstream?

Thanks!

Shiv Konar
  • 43
  • 5

2 Answers2

0

They will both suffer as CSV is such an inefficient store of data. every query will have to scan the entire file, and, because of the way that numbers and dates are stored, the files are far bigger than binary formats

Do not use CSV as a data format except as a one-off ETL operation. Prefer Avro as an exchange format, and use parquet for your tables that you actually do queries on.

For that single ETL stage, don't use schema inference as spark will have to read the table twice, once to determine the schema, and again to process it.

200M rows with 100 columns isn't that big, so you could download locally and play with spark standalone until you had it exporting to an avro/parquet format you liked, then apply to the entire directory, after which you can create a new table with the output

stevel
  • 12,567
  • 1
  • 39
  • 50
  • Thanks @stevel, understood. what if it's same files but in parquet? how would the performance be? – Shiv Konar May 05 '23 at 12:05
  • better as there is much less data to load and if the s3 connector does its bulk reads efficiently, doesn't need to read the whole file. costs you less too... – stevel May 08 '23 at 19:32
0

Read Option 1: spark.read.csv("s3://some-location")

In this approach, you directly read the CSV files using the Spark DataFrameReader's csv method. Spark will automatically infer the schema and load the data into a DataFrame. This approach has the advantage of simplicity and convenience since you don't need to set up a separate table in Hive. However, it may not be the most optimized approach for large datasets because Spark's CSV reader relies on schema inference, which can be computationally expensive and may not always infer the schema accurately. Additionally, when performing transformations and actions downstream, Spark may need to perform additional scans of the data to infer the schema or optimize the query execution plan, which could impact performance. Read Option 2: Create an external table in Hive and read it in Spark

In this approach, you first create an external table in Hive using the S3 location of the CSV files. Then, you can use Spark to query the external table and read the data into a DataFrame. Creating an external table in Hive allows you to define a schema explicitly and potentially optimize it for better performance. By providing the schema upfront, you avoid the schema inference step, which can save computational overhead and ensure the correct schema is used. Furthermore, when performing downstream transformations and actions, Spark can leverage the pre-defined schema to optimize query execution plans, potentially resulting in better performance. This approach gives you more control over the data schema, data partitioning, and other optimizations that can be set up in Hive. Considering the above points, if you have control over the data and can define an appropriate schema for the CSV files, creating an external table in Hive (Read Option 2) can offer better performance and more flexibility for optimization. It allows you to explicitly define the schema, optimize data storage formats, and leverage Hive's capabilities for partitioning, bucketing, and indexing.

However, if you don't have control over the data schema or prefer a simpler approach, Read Option 1 can still work well, especially for smaller datasets or scenarios where the schema inference overhead is acceptable.

In both cases, as you apply further transformations and actions downstream, the choice of approach for reading the data is less impactful compared to the efficiency of the subsequent operations, such as filtering, aggregations, and joins. Optimizing these operations through techniques like data partitioning, caching, and proper use of Spark's APIs and optimizations will have a more significant impact on performance.