1

I'm having trouble describing exactly what I want to achieve. I've tried looking here on stack to find others with the same problem, but are unable to find any. So I will try to describe exactly what I want and give you a sample setup code.

I would like to have a function that gives me a new column/pd.Series. This new column has boolean TRUE values (or int's) that are based on a certain condition.

The condition being as follows. There are N number of columns (example is 8), each with the same name but ending with one new number. IE, column_1, column_2 etc. The function I need is twofold:

  1. If N is given, look for/through each column row and see if it and the next N columns row are also TRUE/1 ..
  2. If N is NOT given, look for each column row and if all next columns rows are also TRUE/1, with the numbers as ID's to look at the column.
def get_df_series(df: pd.DataFrame, columns_ids: list, n: int=8) -> pd.Dataframe:

    for i in columns_ids:
        # missing code here .. i dont know if this would be the way to go
        pass

    return df


def create_dataframe(numbers: list) -> pd.DataFrame:
    df = pd.DataFrame()  # empty df

    # create a column for each number with the number as ID and with random boolean values as int's
    for i in numbers:
        df[f'column_{i}'] = np.random.randint(2, size=20)

    return df


if __name__=="__main__":

    numbers = [1, 2, 3, 4, 5, 6, 7, 8]

    df = create_dataframe(numbers=numbers)

    df = get_df_series(df=df, numbers=numbers, n=3)

I have some experience with Pandas dataframes and know how to create IF/ELSE things with np.select for example.

(function) select(condlist: Sequence[ArrayLike], choicelist: Sequence[ArrayLike], default: ArrayLike = ...) -> NDArray


The problem I'm running into is that I don't know how to make a conditional statement if I don't know how many columns are ahead. For example, if I want to know for column_5 if the next 3 are also true, I can hardcode this, but I have columns up to id 20 and would love to not have to hardcode everything from column_1 to column_20 if I want to know if all conditions in all those columns are true.

Now the problem is that I don't know if this is even possible. So any feedback would be appreaciated. Even just giving me a hint on where to look for a way to do this.

EDIT: What I forgot to mention was that there will be random columns in between that obviously cannot be taking into the equation. For example, there will be main_column_1, main_column_2, main_column_3, side_column_1, side_column_2, right_column_1, main_column_3, main_column_4 etc...

The answer Corralien gave is correct, but I should've made my question more clearer.

I need to be able to, say, look at main_column and for that one look ahead N amount of columns of the same type: main_column.

  • For a row=[0, 1, 1, 1, 1, 0, 1, 1, 1] and N=3, the expected result is [0, 1, 0, 0, 0, 0, 0, 0, 0]? (I have a doubt for last columns when the number of remaining columns is less than N) – Corralien Jan 28 '23 at 16:28
  • Why do you return a Series and not a DataFrame? – Corralien Jan 28 '23 at 16:43
  • My bad, I tried something myself and forgot I had that standing there. The function get_df_series() is currently "not good". I have no idea yet what to do there. Return a dataframe I guesss. I can always delete columns I dont need anymore at the end... For a row=[0, 1, 1, 1, 1, 0, 1, 1, 1], each value being a unique column, the result would indeed be [0, 1, 0, 0, 0, 0, 0, 0, 0].. I think it would be a good idea to have each column (column_1, column_2 etc) to create a result_1, result_2 column etc? – Wicked Gummy Bear Jan 28 '23 at 16:47
  • So if N=5 for example, the last 6 columns will always been equal to 0 (or False)? – Corralien Jan 28 '23 at 17:04
  • N is only the number of columns I wish to look ahead. So if I would start at column_3, and I want to look 2 ahead, I want to know if the same row in column_4 and column_5 are also True – Wicked Gummy Bear Jan 28 '23 at 17:06

1 Answers1

1

Try:

n = 3
out = (df.rolling(n, min_periods=1, axis=1).sum()
         .shift(-n+1, fill_value=0, axis=1).eq(n).astype(int)
         .rename(columns=lambda x: 'result_' + x.split('_')[1]))

Output:

>>> out
    result_1  result_2  result_3  result_4  result_5  result_6  result_7  result_8
0          1         1         1         1         1         1         0         0
1          0         0         0         0         0         0         0         0
2          0         0         0         0         0         0         0         0
3          0         0         0         0         0         0         0         0
4          0         0         0         1         0         0         0         0
5          0         0         0         0         0         0         0         0
6          0         0         0         0         0         0         0         0
7          0         0         0         0         0         0         0         0
8          0         1         1         1         0         0         0         0
9          0         0         0         0         0         1         0         0
10         0         0         0         0         0         0         0         0
11         0         0         0         0         1         0         0         0
12         0         0         0         0         0         0         0         0
13         0         0         0         1         1         0         0         0
14         0         0         0         0         0         1         0         0
15         0         0         0         0         0         0         0         0
16         0         0         0         0         0         0         0         0
17         0         0         1         0         0         0         0         0
18         0         0         1         0         0         0         0         0
19         0         0         0         0         0         0         0         0

Input:

>>> df
    column_1  column_2  column_3  column_4  column_5  column_6  column_7  column_8
0          1         1         1         1         1         1         1         1
1          0         1         0         0         0         1         1         0
2          1         1         0         1         0         1         1         0
3          1         0         1         0         0         0         0         0
4          1         0         0         1         1         1         0         1
5          1         1         0         1         0         1         1         0
6          1         0         1         0         0         0         0         1
7          0         0         1         0         0         0         0         0
8          0         1         1         1         1         1         0         0
9          1         0         1         1         0         1         1         1
10         0         0         1         1         0         0         1         1
11         1         0         1         0         1         1         1         0
12         0         1         1         0         1         0         1         0
13         0         0         0         1         1         1         1         0
14         0         0         1         1         0         1         1         1
15         1         0         0         1         0         1         0         0
16         1         0         0         0         0         0         0         1
17         0         0         1         1         1         0         0         1
18         0         0         1         1         1         0         0         1
19         0         0         1         0         0         0         1         0
Corralien
  • 109,409
  • 8
  • 28
  • 52
  • I'm testing it right now.. Is it possible to create new columns instead of changing the current ones? – Wicked Gummy Bear Jan 28 '23 at 17:09
  • This not replace columns but create a new dataframe (results) – Corralien Jan 28 '23 at 17:12
  • It doesn't seem to account for the current column also having to be positive, is that correct? – Wicked Gummy Bear Jan 28 '23 at 17:14
  • Also, just 1 more question. Lets say for example I have random columns in between. Like, the main columns are: main_column_1, main_column_2, and suddenly theres a side_column_2.. then it will also take that column into consideration? because you use a df.rolling function and you aren't looking specifically at the numbers at the end of the main_column_ ... 1 or 2...? And I was testing your example and there was a row: original: row_9=[0 1 1 1 1 0 0 0] result: row_9=[1 0 0 0 0 0 0 0] Anyway, thank you for the quick responses! – Wicked Gummy Bear Jan 28 '23 at 17:19
  • I think I can maybe make this work by sorting all my columns on name and built in some min/max N values to make sure that when I'm at nr 5, it wont go beyond nr8 etc. I will mark this as solved for now. If you manage to find an answer that works different I'll look at it. For now this might work if I restructure my dataframe. Thank you for the help! – Wicked Gummy Bear Jan 28 '23 at 20:40
  • I fixed the problem for the current column. Change `.shift(n)` to `.shift(n-1)`. – Corralien Jan 28 '23 at 20:41
  • @WickedGummyBear. How you columns are named. It's not a big deal to do it. Let me know. There is a specific pattern? – Corralien Jan 28 '23 at 20:41
  • I can write a small code to filter and reorder the columns to prepare the input for rolling – Corralien Jan 28 '23 at 20:43
  • I would apreciate it, but only if it doesn't take up too much of your time – Wicked Gummy Bear Jan 28 '23 at 20:44
  • Not a problem. Do you have a pattern? It could be easy :-) – Corralien Jan 28 '23 at 20:45
  • The dataframe code is too big to share here, but I found the answer here: https://stackoverflow.com/questions/11067027/sorting-columns-in-pandas-dataframe-based-on-column-name ... I will try and figure out if I can get the rest running from there. Right now I will have to restructure the dataframe build I have to get the logic working with this answer. I will probably just post a new question if I can't manage to get it working. My other logic is all based on pd.Series and conditions etc but pretty complicatedly build. Thank you for all your help! – Wicked Gummy Bear Jan 28 '23 at 20:53
  • 1
    OK. Keep in mind, you just have to create a variable `cols = [......]` and use `df[cols].rolling...`. That's all. – Corralien Jan 28 '23 at 20:54