I have a custom dataframe B
with the following properties (obtained via B.dtypes
):
[236 rows x 10 columns]
Filename object
Path object
Filenumber int64
Channel object
Folder object
User-defined labels object
Ranking int64
Lower limit int64
Upper limit int64
Enabled bool
dtype: object
As example, calling
print(
B.loc[
B["User-defined labels"].explode().eq("Label 1").groupby(level=0).any()
& (B["Filenumber"] == 98)
]
)
gives
Filename Path Filenumber Channel Folder User-defined labels Ranking Lower limit Upper limit Enabled
0 File_98.csv C:\Users\test_user\Documents\Training_data\Label 1\... 98 subfolder_0 Label 1 [Label 1] 0 0 999999 True
39 File_98.csv C:\Users\test_user\Documents\Training_data\Label 1\... 98 subfolder_1 Label 1 [Label 1] 0 0 999999 True
78 File_98.csv C:\Users\test_user\Documents\Training_data\Label 1\... 98 subfolder_2 Label 1 [Label 1] 0 0 999999 True
117 File_98.csv C:\Users\test_user\Documents\Training_data\Label 1\... 98 subfolder_3 Label 1 [Label 1] 0 0 999999 True
Now, I would like to replace/add new labels to the column ["User-defined labels"]
by setting them via a list of labels. The list of labels is defined as
new_labels = ["Label 1", "Label 3"]
I accessed the column via
B.loc[
B["User-defined labels"].explode().eq("Label 1").groupby(level=0).any()
& (B["Filenumber"] == 98),
["User-defined labels"],
]
Initially, I am starting from an output of
Filename Path Filenumber Channel Folder User-defined labels Ranking Lower limit Upper limit Enabled
0 File_98.csv C:\Users\riro\Documents\Training_data\Label 1\... 98 subfolder_0 Label 1 [Label 1] 0 0 999999 True
39 File_98.csv C:\Users\riro\Documents\Training_data\Label 1\... 98 subfolder_1 Label 1 [Label 1] 0 0 999999 True
78 File_98.csv C:\Users\riro\Documents\Training_data\Label 1\... 98 subfolder_2 Label 1 [Label 1] 0 0 999999 True
117 File_98.csv C:\Users\riro\Documents\Training_data\Label 1\... 98 subfolder_3 Label 1 [Label 1] 0 0 999999 True
and would like to obtain
Filename Path Filenumber Channel Folder User-defined labels Ranking Lower limit Upper limit Enabled
0 File_98.csv C:\Users\riro\Documents\Training_data\Label 1\... 98 subfolder_0 Label 1 [Label 1, Label 3] 0 0 999999 True
39 File_98.csv C:\Users\riro\Documents\Training_data\Label 1\... 98 subfolder_1 Label 1 [Label 1, Label 3] 0 0 999999 True
78 File_98.csv C:\Users\riro\Documents\Training_data\Label 1\... 98 subfolder_2 Label 1 [Label 1, Label 3] 0 0 999999 True
117 File_98.csv C:\Users\riro\Documents\Training_data\Label 1\... 98 subfolder_3 Label 1 [Label 1, Label 3] 0 0 999999 True
For that, I tried the following operations:
B.loc[] = new_labels
Fails withMust have equal len keys and value when setting with an iterable
B.loc[] = [new_labels]
Fails withMust have equal len keys and value when setting with an ndarray
B.loc[] = pd.Series(new_labels, dtype=object)
Sets the first entry in the columnUser-defined labels
toLabel 1
(from[Label 1]
) and the entries in all further columns toNaN
B.loc[] = pd.Series([new_labels], dtype=object)
Sets the first entry in the columnUser-defined labels
to[Label 1, Label 3]
(as intended), but all further columns are set toNaN
I repeated the same approach with at()
instead of loc[]
, as described here: https://stackoverflow.com/a/70968810/2546099, resulting in:
B.at[] = new_labels
Fails withMust have equal len keys and value when setting with an iterable
B.at[] = [new_labels]
Fails withMust have equal len keys and value when setting with an ndarray
B.at[] = pd.Series(new_labels, dtype=object)
Fails withunhashable type: 'list'
B.at[] = pd.Series([new_labels], dtype=object)
Fails withunhashable type: 'list'
Are there other approaches which can solve my issues? Of course, I could just add new columns when adding new labels, but that would raise additional issues (based on my current knowledge):
- Each time I add new labels for some rows, I would have to update the entire dataframe to new columns
- If I remove labels, I would have to keep empty columns. For cleanup, I would have to check if a column is completely empty before deleting it
- Iterating over labels is faster if I keep them in a list in one column compared to iterating over all columns which contain the name "label"
Or are those non-issues?