Parquet ReadAsTable() method takes too long for big files

Question

I have this code snippet:

private Table getParquetAsTable(BlobClient blob)
{
     var stream = blob.OpenRead();
     var parquetReader = new ParquetReader(stream);

     return parquetReader.ReadAsTable();
}

whit this code does is it reads a parquet file from Azure blob storage. If my file has <= 10 columns, it gets returned fast however for bigger files I have to wait more than 40 seconds for it to get returned. While debugging, I noticed that the slow "thing" happens in my return parquetReader.ReadAsTable(). I use the ParquetDotNet library for reading a parquet file. Is there a way to speed this up? Can I limit the stream, for the first 100 bytes for example, and have it returned faster? If so, how can I do this?

you should probably put that in a `using` block, to close the stream after use. — JHBonarius, Oct 04 '21 at 08:04
The file I tested is 60mb but bear in mind that bigger files should be read as well @Neil — anthino12, Oct 04 '21 at 11:31
How fast do you need it to be? 40s reading and parsing a 60mb file doesn't seem unreasonable to me. — Neil, Oct 04 '21 at 11:34
I agree, there's no doubt about that. I'm trying to speed it up by reading the first 100 rows from this 60mb file, if possible. — anthino12, Oct 04 '21 at 11:37
As @JHBonarius mentions in my answer, that's not going to speed it up due to the way the files are laid out. Depending on your network connection, it's possible that downloading the whole file, and then accessing it would be quicker (still not ideal for anything over 100MB). — Neil, Oct 04 '21 at 11:40
So reading first n rows from a Parquet, stored in a blob, is impossible without prior reading the whole file? :/ I'm sorry if this question sounds dumb to you, I've never worked with parquets before in my life — anthino12, Oct 04 '21 at 11:43

score 0 · Answer 1 · answered Oct 04 '21 at 08:04

I would suggest reading the "Reading Files" section of the official web site, that shows how to read a row at a time. Obviously, overall this will take the same amount of time (or even longer), but it means you can process rows individually, rather than loading everything at once.

using (Stream fileStream = System.IO.File.OpenRead("c:\\test.parquet"))
{
   // open parquet file reader
   using (var parquetReader = new ParquetReader(fileStream))
   {
      // get file schema (available straight after opening parquet reader)
      // however, get only data fields as only they contain data values
      DataField[] dataFields = parquetReader.Schema.GetDataFields();

      // enumerate through row groups in this file
      for(int i = 0; i < parquetReader.RowGroupCount; i++)
      {
         // create row group reader
         using (ParquetRowGroupReader groupReader = parquetReader.OpenRowGroupReader(i))
         {
            // read all columns inside each row group (you have an option to read only
            // required columns if you need to.
            DataColumn[] columns = dataFields.Select(groupReader.ReadColumn).ToArray();

            // get first column, for instance
            DataColumn firstColumn = columns[0];

            // .Data member contains a typed array of column data you can cast to the type of the column
            Array data = firstColumn.Data;
            int[] ids = (int[])data;
         }
      }
   }
}

Not sure if this will fix it. See [this](https://github.com/elastacloud/parquet-dotnet/issues/307#issuecomment-395505123): _"...The issue is that parquet format is not really designed for streaming at all....The problem here is that parquet metadata is located at the very end of the file, therefore one needs to rewind stream pointer there, or essentially download the whole file first..."_ — JHBonarius, Oct 04 '21 at 08:10
@anthino12 you could at least try the code suggested in this answer. Also read the link. — JHBonarius, Oct 04 '21 at 09:13
I tried it, it's still kind of slow. And about the link, I've seen it previously. However, thank you so much :) — anthino12, Oct 04 '21 at 09:16

Parquet ReadAsTable() method takes too long for big files

1 Answers1