-2

I am new to C#, We have requirement to generate parquet files from csv. Our file sizes up to 30gb, so performance is the matter while generating.

I do not get any help/suggestions from google to handle.

Can someone suggest or share solution please (Either console /Script task).

Aron
  • 15,464
  • 3
  • 31
  • 64
Srini
  • 17
  • 3
  • Does this answer your question? [How to convert a csv file to parquet](https://stackoverflow.com/questions/26124417/how-to-convert-a-csv-file-to-parquet) – A-Tech Nov 15 '22 at 09:56
  • requirement is to create C# console program, thanks – Srini Nov 15 '22 at 10:00
  • 1
    If you're programming in C# one assumes you know what NuGet packages are. Have you looked for a NuGet package that can read and write Parquet files? – AlwaysLearning Nov 15 '22 at 10:14
  • 3
    Why is this tagged SQl Server? anyway I typed _C# Parquet Library_ into google and this was the top link https://www.nuget.org/packages/Parquet.Net It's inconcievable to me that you could not find this. – Nick.Mc Nov 15 '22 at 10:18
  • Does this helps? https://stackoverflow.com/questions/60929842/how-to-convert-a-csv-file-to-parquet-using-c-sharp/62181950#62181950 – Cinchoo Jan 28 '23 at 00:27

3 Answers3

1

You can use this NuGet Package, which includes automatic serializer/deserializer from C# classes into parquet files that works by generating MSIL (bytecode) on the fly and is therefore super fast.

Thomas 94
  • 24
  • 4
0

I haven't tried this yet, but do it via CLI tools and just call those from C# aka "shell out".

yzorg
  • 4,224
  • 3
  • 39
  • 57
0

There are at least 3 different solutions to this problem.

You can read the CSV files into a IEnumerable<Dto> and write the parquet file using either Parquet.Net or ParquetSharp.

The third solution is to use DuckDB.Net to craft a SQL statement to read the CSV directly into a Parquet file.

COPY (
    SELECT * 
    FROM read_csv('flights.csv', delim='|', header=True, columns={'FlightDate': 'DATE', 'UniqueCarrier': 'VARCHAR', 'OriginCityName': 'VARCHAR', 'DestCityName': 'VARCHAR'})
) TO 'test.parquet' (FORMAT 'parquet', COMPRESSION 'ZSTD', ROW_GROUP_SIZE 100000)

Using the DuckDb.Net ADO.NET connector.

Disclaimer: I am a contributor to the DuckDB.Net project.

Aron
  • 15,464
  • 3
  • 31
  • 64