13

I just can't figure out what "==" means at the second line:
- It is not a test, there is no if statement...
- It is not a variable declaration...

I've never seen this before, the thing is data.ctage==cat is a pandas Series and not a test...

for cat in data["categ"].unique():
    subset = data[data.categ == cat] # Création du sous-échantillon
    print("-"*20)
    print('Catégorie : ' + cat)
    print("moyenne:\n",subset['montant'].mean())
    print("mediane:\n",subset['montant'].median())
    print("mode:\n",subset['montant'].mode())
    print("VAR:\n",subset['montant'].var())
    print("EC:\n",subset['montant'].std())
    plt.figure(figsize=(5,5))
    subset["montant"].hist(bins=30) # Crée l'histogramme
    plt.show() # Affiche l'histogramme
Georgy
  • 12,464
  • 7
  • 65
  • 73
Xomuama
  • 153
  • 7
  • 1
    Maybe they're doing an element-wise comparison of two numpy arrays, and using the resulting boolean array as a selector to data? https://stackoverflow.com/questions/10580676/comparing-two-numpy-arrays-for-equality-element-wise Pandas is weird. – Neil Apr 20 '20 at 17:04
  • 1
    It is a mask like `positive_X = X[X > 0]` from numpy. – Guimoute Apr 20 '20 at 17:05

5 Answers5

13

It is testing each element of data.categ for equality with cat. That produces a vector of True/False values. This is passed as in indexer to data[], which returns the rows from data that correspond to the True values in the vector.

To summarize, the whole expression returns the subset of rows from data where the value of data.categ equals cat.

(Seems possible the whole operation could be done more elegantly using data.groupBy('categ').apply(someFunc).)

Dave Costa
  • 47,262
  • 8
  • 56
  • 72
2

It creates a boolean series with indexes where data.categ is equal to cat , with this boolean mask, you can filter your dataframe, in other words subset will have all records where the categ is the value stored in cat.

This is an example using numeric data

np.random.seed(0)
a = np.random.choice(np.arange(2), 5)
b = np.random.choice(np.arange(2), 5)
df = pd.DataFrame(dict(a = a, b = b))


df[df.a == 0].head()

#   a   b
# 0 0   0
# 2 0   0
# 4 0   1

df[df.a == df.b].head()

#   a   b
# 0 0   0
# 2 0   0
# 3 1   1
jcaliz
  • 3,891
  • 2
  • 9
  • 13
2

Yes, it is a test. Boolean expressions are not restricted to if statements.

It looks as if data is a data frame (PANDAS). The expression used as a data frame index is how PANDAS denotes a selector or filter. This says to select every row in which the fieled categ matches the variable cat (apparently a pre-defined variable). This collection of rows becomes a new data frame, subset.

Prune
  • 76,765
  • 14
  • 60
  • 81
2

data.categ == cat will return a boolean list that will be used to filter your dataframe by lefting only values where boolean is equal True.

Booleans are used in many situations, not only in if statements.

Henrique Branco
  • 1,778
  • 1
  • 13
  • 40
2

Here you are checking data.categ with the element iterating, cat, in the dictionary of data.
And if they are equal you are continuing the loop.

Harshit Ruwali
  • 1,040
  • 2
  • 10
  • 22