I have a pandas dataframe that contains 3 columns: path
, tags
and column1
, where tags
is a list of strings
and column1
has boolean
values.
Now I want to group them by path
using regular expression which depends on part on row's path
value. Namely, each row has unique value and i want to group them with other files that contain similar path
.
Then, if all rows in this group meet some condition (in this example, there can't be more than 5 such items), all rows' columns are modified.
def has_to_change(df):
if len(df) > 5:
return True
else:
False
def add_tag_to_tags(row):
if 'tag' not in row['tags']:
row['tags'].append('tag')
return row
if __name__ == '__main__':
pattern = r'some regex'
regex = re.compile(pattern)
df = pd.read_csv(df_path)
for index, row in df.iterrows():
file_name = row['path']
matches = regex.search(file_name)
org_path = matches.group('some regex group') #get a match from this row's path
matching_rows = df[df['path'].str.contains(org_path+'(\.xml|\.txt)')] #find all rows that contain this file name but with some difference, say, another extentions xml or txt
if has_to_change(matching_rows): #if condition met, change it's vale and save back to dataframe
#i keep loop here because i want to overwrite row with the same index (it was originally a bit more complex)
for inner_index, augmented_row in matching_rows.iterrows():
augmented_row['column'] = True
augmented_row.apply(add_tag_to_tags, axis=1)
df.iloc[inner_index] = augmented_row
Can such a code be somehow vectorized? It's super slow but I can't find any way to:
- create groups by regular expression
- check value for each such group as a whole
- and only then update these groups
Example data
path, tags, column1
/mnt/000000386703_aug_13237_0.jpg, ['tag1'], False
/mnt/000000386703_aug_13237_0.xml, ['tag1'], False
/mnt/000000386703_aug_13237_0.txt, ['tag1', 'tag1'], False
/mnt/train_image_png_1221_aug_1245_5.jpg,['tag1'], False
/mnt/000000306488_aug_9203_1.jpg, ['tag1'], False
/mnt/000000391768_aug_20250_1.jpg, ['tag1'], False
/mnt/1561887652.9493463_aug_1462_0.jpg, ['tag1'], True
After update:
path, tags, column1
/mnt/000000386703_aug_13237_0.jpg, ['tag1','tag'], True
/mnt/000000386703_aug_13237_0.xml, ['tag1','tag'], True
/mnt/000000386703_aug_13237_0.txt, ['tag1','tag1', 'tag'], True
/mnt/train_image_png_1221_aug_1245_5.jpg,['tag1'], False
/mnt/000000306488_aug_9203_1.jpg, ['tag1'], False
/mnt/000000391768_aug_20250_1.jpg, ['tag1'], False
/mnt/1561887652.9493463_aug_1462_0.jpg, ['tag1'], True
First 3 rows have tag
value added to tags
column and column value is changed to True
because they share path
value that is caught by regular expression find all rows that have similar path
(so it's dependant on row's path
)