2

I found significant processing time difference in fillna for different column selection techniques of pandas dataframe.

Time taken for fillna of dataframe, whose columns are selected using loc

df1 = df.copy()
t1 = time.time()
df1.loc[:, col] = df1.loc[:, col].fillna(method="ffill")
t2 = time.time()
print(t2-t1)

3.908552885055542

Time taken for fillna of dataframe, whose columns are selected using square bracket:

df1 = df.copy()
t1 = time.time()
df1[col] = df1[col].fillna(method="ffill")
t2 = time.time()
print(t2-t1)

223.85472440719604

This post suggests column selection using loc and square bracket is similar:-
Selecting a list of columns (df[['A', 'B', 'C']] is the same as df.loc[:, ['A', 'B', 'C']] -> selects columns A, B and C)

Can anyone please help why there is time difference? Thanks!!

piyush-balwani
  • 524
  • 3
  • 15
  • 1
    I tested it and the results are similar; there probably is something off in your tests. Maybe if you provide sample data for this observation, so we can reproduce this massive difference – sammywemmy Aug 23 '21 at 10:31
  • Did it with `%%timeit` on my own data, and the `.loc[col]` was 680us while the `[col]` was 396us. – Aryerez Aug 23 '21 at 10:41
  • I uploaded sample dataframe [file](https://github.com/Piyushbalwani/fillna_data/blob/main/sample.pickle). `[col]` 10sec while `loc[col]` 0.3sec for this sample data – piyush-balwani Aug 23 '21 at 10:49
  • Dataframe indexing a complicated task, involving index and column arrays. It's much more involved than `numpy` indexing with uses positions and compact multidimensional array. For a start you could look at `df.__getitiem__` to see the code (probably python) that starts the indexing. – hpaulj Aug 23 '21 at 16:05

1 Answers1

0

I have collected few point in 2014, while i was testing it over 2M rows. I found it interesting from SO thread which i collected as follows.

In general, you should use loc for label-based assignment, and iloc for integer/positional based assignment, as the spec guarantees that they always operate on the original.

Better to look at select a subset of a DataFrame

- loc is faster, because it does not try to create a copy of the data.

- loc is meant to modify your existing dataframe inplace, which is more memory efficient.

- loc is predictable, it has one behavior.

- df.loc's syntax is explicit, with df.loc[indexer] you know automatically that df.loc is selecting rows. In contrast, it is not clear if df[indexer] will select rows or columns (or raise ValueError) without knowing details about indexer and df.

When using loc

df.loc[:] = Dataframe

df.loc[int] = Dataframe if you have more than one column and Series if you have only 1 column in the dataframe

df.loc[:, ["col_name"]] = Dataframe if you have more than one row and Series if you have only 1 row in the selection

df.loc[:, "col_name"] = Series

Not using loc

df["col_name"] = Series

df[["col_name"]] = Dataframe

look at here for some interesting detailsPerformance Consideration on multiple columns "Chained Assignment" with and without using .loc

Karn Kumar
  • 8,518
  • 3
  • 27
  • 53
  • Doesn't really answer the question – sammywemmy Aug 23 '21 at 12:11
  • sammywemmy, this is more of a performance related question where a direct answer may not be possible but some hints and used cases can be provided, i have seen discussion over this and saw many champs having their own opinion based on their use cases, this is what placed under my answers section because i can not put these learning over comments, hope you get it! – Karn Kumar Aug 23 '21 at 12:18