1

I have a custom dataframe B with the following properties (obtained via B.dtypes):

[236 rows x 10 columns]
Filename               object
Path                   object
Filenumber              int64
Channel                object
Folder                 object
User-defined labels    object
Ranking                 int64
Lower limit             int64
Upper limit             int64
Enabled                  bool
dtype: object

As example, calling

print(
    B.loc[
        B["User-defined labels"].explode().eq("Label 1").groupby(level=0).any()
        & (B["Filenumber"] == 98)
    ]
)

gives

        Filename                                               Path  Filenumber      Channel   Folder User-defined labels  Ranking  Lower limit  Upper limit  Enabled
0    File_98.csv  C:\Users\test_user\Documents\Training_data\Label 1\...          98  subfolder_0  Label 1           [Label 1]        0            0       999999     True
39   File_98.csv  C:\Users\test_user\Documents\Training_data\Label 1\...          98  subfolder_1  Label 1           [Label 1]        0            0       999999     True
78   File_98.csv  C:\Users\test_user\Documents\Training_data\Label 1\...          98  subfolder_2  Label 1           [Label 1]        0            0       999999     True
117  File_98.csv  C:\Users\test_user\Documents\Training_data\Label 1\...          98  subfolder_3  Label 1           [Label 1]        0            0       999999     True

Now, I would like to replace/add new labels to the column ["User-defined labels"] by setting them via a list of labels. The list of labels is defined as

new_labels = ["Label 1", "Label 3"]

I accessed the column via

B.loc[
    B["User-defined labels"].explode().eq("Label 1").groupby(level=0).any()
    & (B["Filenumber"] == 98),
    ["User-defined labels"],
]

Initially, I am starting from an output of

        Filename                                               Path  Filenumber      Channel   Folder User-defined labels  Ranking  Lower limit  Upper limit  Enabled
0    File_98.csv  C:\Users\riro\Documents\Training_data\Label 1\...          98  subfolder_0  Label 1           [Label 1]        0            0       999999     True
39   File_98.csv  C:\Users\riro\Documents\Training_data\Label 1\...          98  subfolder_1  Label 1           [Label 1]        0            0       999999     True
78   File_98.csv  C:\Users\riro\Documents\Training_data\Label 1\...          98  subfolder_2  Label 1           [Label 1]        0            0       999999     True
117  File_98.csv  C:\Users\riro\Documents\Training_data\Label 1\...          98  subfolder_3  Label 1           [Label 1]        0            0       999999     True

and would like to obtain

        Filename                                               Path  Filenumber      Channel   Folder User-defined labels  Ranking  Lower limit  Upper limit  Enabled
0    File_98.csv  C:\Users\riro\Documents\Training_data\Label 1\...          98  subfolder_0  Label 1           [Label 1, Label 3]        0            0       999999     True
39   File_98.csv  C:\Users\riro\Documents\Training_data\Label 1\...          98  subfolder_1  Label 1           [Label 1, Label 3]        0            0       999999     True
78   File_98.csv  C:\Users\riro\Documents\Training_data\Label 1\...          98  subfolder_2  Label 1           [Label 1, Label 3]        0            0       999999     True
117  File_98.csv  C:\Users\riro\Documents\Training_data\Label 1\...          98  subfolder_3  Label 1           [Label 1, Label 3]        0            0       999999     True

For that, I tried the following operations:

  • B.loc[] = new_labels
    Fails with Must have equal len keys and value when setting with an iterable

  • B.loc[] = [new_labels]
    Fails with Must have equal len keys and value when setting with an ndarray

  • B.loc[] = pd.Series(new_labels, dtype=object)
    Sets the first entry in the column User-defined labels to Label 1 (from [Label 1]) and the entries in all further columns to NaN

  • B.loc[] = pd.Series([new_labels], dtype=object) Sets the first entry in the column User-defined labels to [Label 1, Label 3] (as intended), but all further columns are set to NaN

I repeated the same approach with at() instead of loc[], as described here: https://stackoverflow.com/a/70968810/2546099, resulting in:

  • B.at[] = new_labels
    Fails with Must have equal len keys and value when setting with an iterable

  • B.at[] = [new_labels]
    Fails with Must have equal len keys and value when setting with an ndarray

  • B.at[] = pd.Series(new_labels, dtype=object)
    Fails with unhashable type: 'list'

  • B.at[] = pd.Series([new_labels], dtype=object) Fails with unhashable type: 'list'

Are there other approaches which can solve my issues? Of course, I could just add new columns when adding new labels, but that would raise additional issues (based on my current knowledge):

  • Each time I add new labels for some rows, I would have to update the entire dataframe to new columns
  • If I remove labels, I would have to keep empty columns. For cleanup, I would have to check if a column is completely empty before deleting it
  • Iterating over labels is faster if I keep them in a list in one column compared to iterating over all columns which contain the name "label"

Or are those non-issues?

arc_lupus
  • 3,942
  • 5
  • 45
  • 81
  • Do you want to change the value of label to `new_labels`? Why don't you use `B["User-defined labels"] = anything`. – Amir Pourmand Aug 05 '22 at 09:44
  • Wouldn't that change the entire column? I only want to change some specific rows, not the entire column – arc_lupus Aug 05 '22 at 09:45
  • That's right. You can still use `df.loc[row_indexer,column_indexer]`. Take a look at [here](https://pandas.pydata.org/docs/user_guide/indexing.html) if this is what you meant. – Amir Pourmand Aug 05 '22 at 09:47
  • Hmm, I tried that (first test), and it failed for me, as stated above. Or did I misunderstand something? – arc_lupus Aug 05 '22 at 09:55
  • I tested with numpy and understood your problem ... It seems stupid that pandas gives error :( – Amir Pourmand Aug 05 '22 at 10:06

1 Answers1

2

I think you can achieve your goal with apply:

import pandas as pd

df = pd.DataFrame({'a':
    [
        ['label 1'],
        ['label 1'],
        ['label 1'],
        ['label 1', 'label 2'],
        ['label 2', 'label 3'],
        ['label 1'],
        ['label 1'],
        ['label 1'],
        ['label 1'],
        ['label 1'],
    ],
    'b': [
        1,
        2,
        3,
        1,
        1,
        2,
        3,
        1,
        2,
        3,
    ]
})

print(df)
                    a  b
0           [label 1]  1
1           [label 1]  2
2           [label 1]  3
3  [label 1, label 2]  1
4  [label 2, label 3]  1
5           [label 1]  2
6           [label 1]  3
7           [label 1]  1
8           [label 1]  2
9           [label 1]  3

iterating through the rows with lambda will set a new value in every row with an index that exists in the mask

mask = df['a'].loc[df['a'].explode().eq("label 1").groupby(level=0).any() & (df['b'] == 1)]
df['a'] = df.apply(lambda x: ['label 1', 'label 3'] if x.name in mask.index else x['a'], axis=1)

print(df)

                    a  b
0  [label 1, label 3]  1
1           [label 1]  2
2           [label 1]  3
3  [label 1, label 3]  1
4  [label 2, label 3]  1
5           [label 1]  2
6           [label 1]  3
7  [label 1, label 3]  1
8           [label 1]  2
9           [label 1]  3
Ze'ev Ben-Tsvi
  • 1,174
  • 1
  • 3
  • 7