Pandas - vectorizing convoluted update

Question

I have a pandas dataframe that contains 3 columns: path, tags and column1, where tags is a list of strings and column1 has boolean values. Now I want to group them by path using regular expression which depends on part on row's path value. Namely, each row has unique value and i want to group them with other files that contain similar path. Then, if all rows in this group meet some condition (in this example, there can't be more than 5 such items), all rows' columns are modified.

def has_to_change(df):
  if len(df) > 5:
      return True
  else:
      False

def add_tag_to_tags(row):
  if 'tag' not in row['tags']:
      row['tags'].append('tag')
  return row

if __name__ == '__main__':

  pattern =  r'some regex'
  regex =  re.compile(pattern)

  df = pd.read_csv(df_path)

  for index, row in df.iterrows():
      file_name = row['path']
      matches = regex.search(file_name)
      org_path = matches.group('some regex group') #get a match from this row's path

      matching_rows = df[df['path'].str.contains(org_path+'(\.xml|\.txt)')] #find all rows that contain this file name but with some difference, say, another extentions xml or txt

      if has_to_change(matching_rows): #if condition met, change it's vale and save back to dataframe
          #i keep loop here because i want to overwrite row with the same index (it was originally a bit more complex)
          for inner_index, augmented_row in matching_rows.iterrows():
              augmented_row['column'] = True
              augmented_row.apply(add_tag_to_tags, axis=1)
              df.iloc[inner_index] = augmented_row

Can such a code be somehow vectorized? It's super slow but I can't find any way to:

create groups by regular expression
check value for each such group as a whole
and only then update these groups

Example data

      path,                              tags,             column1
/mnt/000000386703_aug_13237_0.jpg,       ['tag1'],         False
/mnt/000000386703_aug_13237_0.xml,       ['tag1'],         False
/mnt/000000386703_aug_13237_0.txt,       ['tag1', 'tag1'], False
/mnt/train_image_png_1221_aug_1245_5.jpg,['tag1'],         False
/mnt/000000306488_aug_9203_1.jpg,        ['tag1'],         False
/mnt/000000391768_aug_20250_1.jpg,       ['tag1'],         False
/mnt/1561887652.9493463_aug_1462_0.jpg,  ['tag1'],         True

After update:

      path,                              tags,                   column1
/mnt/000000386703_aug_13237_0.jpg,       ['tag1','tag'],         True
/mnt/000000386703_aug_13237_0.xml,       ['tag1','tag'],         True
/mnt/000000386703_aug_13237_0.txt,       ['tag1','tag1', 'tag'], True
/mnt/train_image_png_1221_aug_1245_5.jpg,['tag1'],               False
/mnt/000000306488_aug_9203_1.jpg,        ['tag1'],               False
/mnt/000000391768_aug_20250_1.jpg,       ['tag1'],               False
/mnt/1561887652.9493463_aug_1462_0.jpg,  ['tag1'],               True

First 3 rows have tag value added to tags column and column value is changed to True because they share path value that is caught by regular expression find all rows that have similar path (so it's dependant on row's path)

Your question is quite general so I will point you to a [general answer](https://stackoverflow.com/a/55557758/4909087) that should hopefully help you find the right functions to replace the call to iterrows(). — cs95, Jul 14 '20 at 09:52
this general answer does not answer my question of how to group rows by regular expression that depends on row's value. — Bartek Wójcik, Jul 14 '20 at 10:00

Pandas - vectorizing convoluted update

0 Answers0