0

Consider the data set:

small TED dataset

(a very small dataset derived from this Kaggle Datasets, which is available under the CC BY-NC-SA 4.0 license.)

In the code below, I calculate the filters boolean list using values of a different column each time. Then I apply that filter column on the DataFrame as a boolean index. I get the correct set of results each time, though I never mention the column name while doing my indexing ! How is pandas applying the boolean indexing on the correct column each time ? I know that the filters boolean list has no meta information about which column of the DataFrame was used when constructing it. So, am totally perplexed as to how this is happening !

import pandas as pd
df = pd.read_csv("ted_small.csv")

#1: Lets try to filter by "comments" > 500 first
filter_by = df["comments"]
filters = []
for i in filter_by:
    if i > 500:
        filters.append(True)
    else:
        filters.append(False)

print(f"Filters list is: {filters}")
df[filters]

It correctly outputs only those rows with comments > 500:

enter image description here

Then I change my list to be constructed based on the values of "duration".

import pandas as pd
df = pd.read_csv("ted_small.csv")

#2: Lets try to filter by "duration" > 1000 now
filter_by = df["duration"]
filters = []
for i in filter_by:
    if i > 1000:
        filters.append(True)
    else:
        filters.append(False)

df[filters]

It correctly outputs only those rows with duration > 1000 !!!

enter image description here

How is this magic happening ?

IF I were to do something like this:

df[df['comments'] > 500]

I do understand why I would get the correct result. It is because there is some meta information on what column the filter was derived from, as is seen using the output of:

df['comments'] > 500
0     True
1    False
2    False
3    False
4     True
5     True
6     True
7    False
8     True
9     True
Name: comments, dtype: bool

(Note the reference to "comments" in the output above)

EDIT: Thanks to the discussion in the comments section, I understood it now ! After the filters boolean list is created, the exact column used to create that boolean list doesnt matter. Simply, rows that have True in the list will be returned.

2020
  • 2,821
  • 2
  • 23
  • 40
  • 1
    The selection of column is implicit in your for loop, i.e. `for i in filter_by:...` where `filter_by` is column dependent. In other words, the Boolean values are a function of your column name. – jpp Dec 31 '19 at 17:10
  • 1
    Side note: you can replace your for loop with just `df['comments'] > 500` Or `df['comments'].gt(500)`. – Erfan Dec 31 '19 at 17:15
  • My question is how does pandas apply the `filters` list on the correct column. After all, `filters` is simply a boolean list. It could have been used on the `comments` column or the `duration` column or `languages` column or `views` column. Isnt it ? – 2020 Dec 31 '19 at 17:15
  • 1
    @2020 the magic happens on rows and not columns, all the rows having values as `True` are returned . If you notice, the len of list and number of rows match – anky Dec 31 '19 at 17:19
  • 2
    It simply keeps track of the information. There is no extra meta-information on what the column is. – Willem Van Onsem Dec 31 '19 at 17:21
  • @Erfan: I can understand why I would get the correct result if I were to do something like you said. I have edited my question to clarift that. But how come I get the correct result when my `filters` list has no reference to the column that was used to derive its values from. Thats my question – 2020 Dec 31 '19 at 17:22
  • @anky_91: I dont think you understand my question. The filtering is applied on the columns to choose the rows that have True value on a specific column. My question is how is that "specific" column chosen, given a boolean index ? – 2020 Dec 31 '19 at 17:24
  • There's no need of reference. You can basically see each `True` as a marker of your condition in your for loop. It marks each row of your condition (`df['comment'] > 500`) as `True`. Then you pass the whole list of `booleans` to your dataframe and only the rows where your list is `True` gets returned. Thats how `boolean indexing` works in a nutshell. Try passing `df[filters[:-1]` to your dataframe. You will get an error. – Erfan Dec 31 '19 at 17:24
  • @2020 i definitely didnt :) trying hard now – anky Dec 31 '19 at 17:25
  • @Erfan: I dont get an error when I try : `df[filters[-1::-1]]`. It just gives some arbitrary results that neither matches comments > 500 or duration > 1000 ! – 2020 Dec 31 '19 at 17:37
  • Thanks Erfan and anky_91: I understood it now ! After the filters boolean list is created, the exact column used to create that boolean list doesnt matter. Simply. rows that have True will be returned (and that will ofcourse match the condition used to create the boolean list) – 2020 Dec 31 '19 at 17:42

0 Answers0