3

The CSV file I have is 70 Gb in size. I want to load the DF and count the number of rows, in lazy mode. What's the best way to do so?

As far as I can tell, there is no function like shape in lazy mode according to the documentation. I found this answer which provide a solution not based on Polars, but I wonder if it is possible to do this in Polars as well.

  • What have you found when you did a websearch for "polars get row count lazy"? Was anything applicable to your case? If not, why did it not work? – Saaru Lindestøkke Feb 21 '23 at 16:52
  • why do you need a "polars-based" solution? You're IO-bound here in terms of performance (no computationally intensive operation) so I'd doubt you'll get any benefit from using Rust-based code... at least I'd try to measure first if this is a performance bottle-neck. – FObersteiner Feb 21 '23 at 16:52
  • @SaaruLindestøkke Websearch for "polars get row count lazy" does not yield relevant result. – roei shlezinger Feb 21 '23 at 18:00
  • @FObersteiner The answer I attached to the original post provided a solution. I ask out of curiosity. I have updated the post to clarify this. Thanks for the feedback – roei shlezinger Feb 21 '23 at 18:01
  • Does this not help? https://pola-rs.github.io/polars/py-polars/html/reference/lazyframe/api/polars.LazyFrame.with_row_count.html – Saaru Lindestøkke Feb 21 '23 at 18:16
  • Thanks @SaaruLindestøkke for your response, but Dean MacGregor's suggestion fits my requirements more closely. Unfortunately, with_row_count adds a column to the DF, which was not my intention, and there were performance concerns with this approach – roei shlezinger Feb 21 '23 at 18:52
  • If that's the right answer for you then please hit the check mark – Dean MacGregor Feb 21 '23 at 23:14

3 Answers3

7

To get the row count using polars.

First load it into a lazyframe...

lzdf=pl.scan_csv("mybigfile.csv")

Then count the rows and return the result

lzdf.select(pl.count()).collect()

If you just want a python scalar rather than a table as a result then just subset it

lzdf.select(pl.count()).collect()[0,0]

I'm curious if polars can count the lines faster than a generic python method given that you're almost certainly just IO bound.

Dean MacGregor
  • 11,847
  • 9
  • 34
  • 72
  • 6
    You can use [`.item()`](https://pola-rs.github.io/polars/py-polars/html/reference/dataframe/api/polars.DataFrame.item.html) instead of `[0, 0]` (if you're not already aware) – jqurious Feb 21 '23 at 17:18
0

I have not found an efficient way to get the length of large CSV files using LazyFrames in Polars. This How to get line count of a large file cheaply in Python? is actually the quick and memory efficient solution, although it is not in Polars' LazyFrames

Thomas K
  • 25
  • 7
0

Polars dataframe class doesn't have collect() method. For normal dataframes (non lazy dataframes), We could use pl.count and get the result as below .


import polars as pl
df = pl.DataFrame({"a": [1, 8, 3], "b": [4, 5, 2], "c": ["foo", "bar", "foo"]})

row_count = df.select(pl.count())[0,0]

or 

row_count = df.select(pl.count()).item()
Sairam Krish
  • 10,158
  • 3
  • 55
  • 67
  • Currently accepted answer uses `lzdf` to hint that lazy dataframe is used. Collect is necessary for lazy dfs, but not 'normal' dfs – Alleo Aug 12 '23 at 05:33