0

Context

I'm working on a DataFrame df with lots of columns filled with numerical values

df
lorem ipsum  |  dolor sic  |  ...  |  (hundreds of cols)
---------------------------------------------------------
0.5          |     -6.2    |  ...  | 79.8
-26.1        |     6200.0  |  ...  | -65.2
150.0        |     3.14    |  ...  | 1.008

By another mean, I have a list_cols of columns:

list_cols = ['lorem ipsum', 'dolor sic', ... ]  # arbitrary length, of course len(list_cols ) <= len(df.columns), and contains valid columns of my df

I want to obtain 2 dataframes :

  • 1 that contains all rows where value < 0 for at least one of list_cols (corresponds to a OR ). let's call it negative_values_matches
  • 1 that corresponds to the remaining of dataframe, lets call it positive_values_matches

Expected result example

for list_cols = ['lorem ipsum', 'dolor sic'], I shall obtain dataframes were at least 1 value in list_cols is strictly negative:

negative_values_matches
lorem ipsum  |  dolor sic  |  ...  |  (hundreds of cols)
---------------------------------------------------------
0.5          |     -6.2    |  ...  | 79.8
-26.1        |     6200.0  |  ...  | -65.2


positive_values_matches
lorem ipsum  |  dolor sic  |  ...  |  (hundreds of cols)
---------------------------------------------------------
150.0        |     3.14    |  ...  | 1.008

I don't want to write myslef this kind of code:

negative_values_matches = df[ (criterion1 | criterion2 | ... | criterionn)]
positive_values_matches = df[~(criterion1 | criterion2 | ... | criterionn)]

(where criterionk is a boolean evaluation for column k such as for instance: (df[col_k]>=0), parenthesis intended here since its the Pandas syntax)

The idea is to have a programmatic approach. I'm mainly looking for an array of booleans, so I can then use Boolean indexing (see Pandas documentation).

As far as I can tell, these posts are not exactly what I am talking about:

I can't figure out how to chain the booleans evaluations on my DataFrame altogether with ORoperator anbd obtain the correct rows splitting.

What can I do ?

LoneWanderer
  • 3,058
  • 1
  • 23
  • 41

1 Answers1

1

After several attempts, I managed to achieve my goal.

Here is the code:

import Pandas
import numpy
# assume dataframe exists
df = ...
# initiliaze an array of False, matching df number of rows
resulting_bools = numpy.zeros((1, len(df.index)), dtype=bool)

for col in list_cols:
    # obtain array of booleans for given column and boolean condition for [row, column] value
    criterion = df[col].map(lambda x: x < 0) # same condition for each column, different conditions would have been more difficult (for me)

     # perform cumulative boolean evaluation accross columns
    resulting_bools |= criterion

# use the array of booleans to build the required df
negative_values_matches = df[ resulting_bools].copy() # use .copy() to avoid further possible warnings from Pandas depending on what you do with your data frame
positive_values_matches = df[~resulting_bools].copy()

This way, I successfully obtained 2 dataframes:

  • 1 with all rows that have a value < 0 for at least 1 of the column in list_cols
  • 1 with all other rows (value >= 0 for each col in list_col)

(The array initialization on False depends on the boolean evaluations choices)


NB: The approach could be combined with multiple conditions on dataframes. To be confirmed.

LoneWanderer
  • 3,058
  • 1
  • 23
  • 41