Python Polars: How to get the row count of a DataFrame?

Question

The CSV file I have is 70 Gb in size. I want to load the DF and count the number of rows, in lazy mode. What's the best way to do so?

As far as I can tell, there is no function like shape in lazy mode according to the documentation. I found this answer which provide a solution not based on Polars, but I wonder if it is possible to do this in Polars as well.

What have you found when you did a websearch for "polars get row count lazy"? Was anything applicable to your case? If not, why did it not work? — Saaru Lindestøkke, Feb 21 '23 at 16:52
why do you need a "polars-based" solution? You're IO-bound here in terms of performance (no computationally intensive operation) so I'd doubt you'll get any benefit from using Rust-based code... at least I'd try to measure first if this is a performance bottle-neck. — FObersteiner, Feb 21 '23 at 16:52
@SaaruLindestøkke Websearch for "polars get row count lazy" does not yield relevant result. — roei shlezinger, Feb 21 '23 at 18:00
@FObersteiner The answer I attached to the original post provided a solution. I ask out of curiosity. I have updated the post to clarify this. Thanks for the feedback — roei shlezinger, Feb 21 '23 at 18:01
Does this not help? https://pola-rs.github.io/polars/py-polars/html/reference/lazyframe/api/polars.LazyFrame.with_row_count.html — Saaru Lindestøkke, Feb 21 '23 at 18:16
Thanks @SaaruLindestøkke for your response, but Dean MacGregor's suggestion fits my requirements more closely. Unfortunately, with_row_count adds a column to the DF, which was not my intention, and there were performance concerns with this approach — roei shlezinger, Feb 21 '23 at 18:52
If that's the right answer for you then please hit the check mark — Dean MacGregor, Feb 21 '23 at 23:14

score 7 · Accepted Answer · answered Feb 21 '23 at 17:09

7

To get the row count using polars.

First load it into a lazyframe...

lzdf=pl.scan_csv("mybigfile.csv")

Then count the rows and return the result

lzdf.select(pl.count()).collect()

If you just want a python scalar rather than a table as a result then just subset it

lzdf.select(pl.count()).collect()[0,0]

I'm curious if polars can count the lines faster than a generic python method given that you're almost certainly just IO bound.

answered Feb 21 '23 at 17:09

Dean MacGregor

11,847
9
34
72

6

You can use [`.item()`](https://pola-rs.github.io/polars/py-polars/html/reference/dataframe/api/polars.DataFrame.item.html) instead of `[0, 0]` (if you're not already aware) – jqurious Feb 21 '23 at 17:18

score 0 · Answer 2 · answered Jun 12 '23 at 15:33

I have not found an efficient way to get the length of large CSV files using LazyFrames in Polars. This How to get line count of a large file cheaply in Python? is actually the quick and memory efficient solution, although it is not in Polars' LazyFrames

Sairam Krish · Answer 3 · 2023-08-12T08:43:50.987

0

Polars dataframe class doesn't have collect() method. For normal dataframes (non lazy dataframes), We could use pl.count and get the result as below .


import polars as pl
df = pl.DataFrame({"a": [1, 8, 3], "b": [4, 5, 2], "c": ["foo", "bar", "foo"]})

row_count = df.select(pl.count())[0,0]

or 

row_count = df.select(pl.count()).item()

edited Aug 12 '23 at 08:43

answered Jul 12 '23 at 10:47

Sairam Krish

10,158
3
55
67

Currently accepted answer uses `lzdf` to hint that lazy dataframe is used. Collect is necessary for lazy dfs, but not 'normal' dfs – Alleo Aug 12 '23 at 05:33

Python Polars: How to get the row count of a DataFrame?

3 Answers3