Setting cell in pandas dataframe with list of strings

Question

I have a custom dataframe B with the following properties (obtained via B.dtypes):

[236 rows x 10 columns]
Filename               object
Path                   object
Filenumber              int64
Channel                object
Folder                 object
User-defined labels    object
Ranking                 int64
Lower limit             int64
Upper limit             int64
Enabled                  bool
dtype: object

As example, calling

print(
    B.loc[
        B["User-defined labels"].explode().eq("Label 1").groupby(level=0).any()
        & (B["Filenumber"] == 98)
    ]
)

gives

        Filename                                               Path  Filenumber      Channel   Folder User-defined labels  Ranking  Lower limit  Upper limit  Enabled
0    File_98.csv  C:\Users\test_user\Documents\Training_data\Label 1\...          98  subfolder_0  Label 1           [Label 1]        0            0       999999     True
39   File_98.csv  C:\Users\test_user\Documents\Training_data\Label 1\...          98  subfolder_1  Label 1           [Label 1]        0            0       999999     True
78   File_98.csv  C:\Users\test_user\Documents\Training_data\Label 1\...          98  subfolder_2  Label 1           [Label 1]        0            0       999999     True
117  File_98.csv  C:\Users\test_user\Documents\Training_data\Label 1\...          98  subfolder_3  Label 1           [Label 1]        0            0       999999     True

Now, I would like to replace/add new labels to the column ["User-defined labels"] by setting them via a list of labels. The list of labels is defined as

new_labels = ["Label 1", "Label 3"]

I accessed the column via

B.loc[
    B["User-defined labels"].explode().eq("Label 1").groupby(level=0).any()
    & (B["Filenumber"] == 98),
    ["User-defined labels"],
]

Initially, I am starting from an output of

        Filename                                               Path  Filenumber      Channel   Folder User-defined labels  Ranking  Lower limit  Upper limit  Enabled
0    File_98.csv  C:\Users\riro\Documents\Training_data\Label 1\...          98  subfolder_0  Label 1           [Label 1]        0            0       999999     True
39   File_98.csv  C:\Users\riro\Documents\Training_data\Label 1\...          98  subfolder_1  Label 1           [Label 1]        0            0       999999     True
78   File_98.csv  C:\Users\riro\Documents\Training_data\Label 1\...          98  subfolder_2  Label 1           [Label 1]        0            0       999999     True
117  File_98.csv  C:\Users\riro\Documents\Training_data\Label 1\...          98  subfolder_3  Label 1           [Label 1]        0            0       999999     True

and would like to obtain

        Filename                                               Path  Filenumber      Channel   Folder User-defined labels  Ranking  Lower limit  Upper limit  Enabled
0    File_98.csv  C:\Users\riro\Documents\Training_data\Label 1\...          98  subfolder_0  Label 1           [Label 1, Label 3]        0            0       999999     True
39   File_98.csv  C:\Users\riro\Documents\Training_data\Label 1\...          98  subfolder_1  Label 1           [Label 1, Label 3]        0            0       999999     True
78   File_98.csv  C:\Users\riro\Documents\Training_data\Label 1\...          98  subfolder_2  Label 1           [Label 1, Label 3]        0            0       999999     True
117  File_98.csv  C:\Users\riro\Documents\Training_data\Label 1\...          98  subfolder_3  Label 1           [Label 1, Label 3]        0            0       999999     True

For that, I tried the following operations:

B.loc[] = new_labels
Fails with Must have equal len keys and value when setting with an iterable
B.loc[] = [new_labels]
Fails with Must have equal len keys and value when setting with an ndarray
B.loc[] = pd.Series(new_labels, dtype=object)
Sets the first entry in the column User-defined labels to Label 1 (from [Label 1]) and the entries in all further columns to NaN
B.loc[] = pd.Series([new_labels], dtype=object) Sets the first entry in the column User-defined labels to [Label 1, Label 3] (as intended), but all further columns are set to NaN

I repeated the same approach with at() instead of loc[], as described here: https://stackoverflow.com/a/70968810/2546099, resulting in:

B.at[] = new_labels
Fails with Must have equal len keys and value when setting with an iterable
B.at[] = [new_labels]
Fails with Must have equal len keys and value when setting with an ndarray
B.at[] = pd.Series(new_labels, dtype=object)
Fails with unhashable type: 'list'
B.at[] = pd.Series([new_labels], dtype=object) Fails with unhashable type: 'list'

Are there other approaches which can solve my issues? Of course, I could just add new columns when adding new labels, but that would raise additional issues (based on my current knowledge):

Each time I add new labels for some rows, I would have to update the entire dataframe to new columns
If I remove labels, I would have to keep empty columns. For cleanup, I would have to check if a column is completely empty before deleting it
Iterating over labels is faster if I keep them in a list in one column compared to iterating over all columns which contain the name "label"

Or are those non-issues?

Do you want to change the value of label to `new_labels`? Why don't you use `B["User-defined labels"] = anything`. — Amir Pourmand, Aug 05 '22 at 09:44
Wouldn't that change the entire column? I only want to change some specific rows, not the entire column — arc_lupus, Aug 05 '22 at 09:45
That's right. You can still use `df.loc[row_indexer,column_indexer]`. Take a look at [here](https://pandas.pydata.org/docs/user_guide/indexing.html) if this is what you meant. — Amir Pourmand, Aug 05 '22 at 09:47
Hmm, I tried that (first test), and it failed for me, as stated above. Or did I misunderstand something? — arc_lupus, Aug 05 '22 at 09:55
I tested with numpy and understood your problem ... It seems stupid that pandas gives error :( — Amir Pourmand, Aug 05 '22 at 10:06

score 2 · Answer 1 · answered Aug 05 '22 at 11:57

I think you can achieve your goal with apply:

import pandas as pd

df = pd.DataFrame({'a':
    [
        ['label 1'],
        ['label 1'],
        ['label 1'],
        ['label 1', 'label 2'],
        ['label 2', 'label 3'],
        ['label 1'],
        ['label 1'],
        ['label 1'],
        ['label 1'],
        ['label 1'],
    ],
    'b': [
        1,
        2,
        3,
        1,
        1,
        2,
        3,
        1,
        2,
        3,
    ]
})

print(df)
                    a  b
0           [label 1]  1
1           [label 1]  2
2           [label 1]  3
3  [label 1, label 2]  1
4  [label 2, label 3]  1
5           [label 1]  2
6           [label 1]  3
7           [label 1]  1
8           [label 1]  2
9           [label 1]  3

iterating through the rows with lambda will set a new value in every row with an index that exists in the mask

mask = df['a'].loc[df['a'].explode().eq("label 1").groupby(level=0).any() & (df['b'] == 1)]
df['a'] = df.apply(lambda x: ['label 1', 'label 3'] if x.name in mask.index else x['a'], axis=1)

print(df)

                    a  b
0  [label 1, label 3]  1
1           [label 1]  2
2           [label 1]  3
3  [label 1, label 3]  1
4  [label 2, label 3]  1
5           [label 1]  2
6           [label 1]  3
7  [label 1, label 3]  1
8           [label 1]  2
9           [label 1]  3

Setting cell in pandas dataframe with list of strings

1 Answers1