1

I have a pandas dataset with x number of batches (batch sizes are different, i.e rows), now I create a new feature for each batch using the respective batch data.

I want to automate this process, e.g.first create a new column then iterate over the batch id column until it has the same batch id, create new feature values and append the newly created column, then continue to next batch

enter image description here

here is code for the manual method for single batch

from sklearn.neighbors import BallTree 

batch = samples.loc[samples['batch id'] == 'XX']

tree = BallTree(red_points[['col1','col2']], leaf_size=15, metric='minkowski')
distance, index = tree.query(batch[['col1','col2']], k=2)

batch_size = batch.shape[0]

batch['new feature'] = distance[np.arange(batch_size),batch.col3]
  • I think [this question](https://stackoverflow.com/questions/24980437/pandas-groupby-and-then-merge-on-original-table/24980809) may do this nicely? – Josh Ziegler Jan 09 '21 at 01:00
  • 1
    Something like df.groupby('batch id').transform(func) where func takes batch and returns what you want to put into the new feature. – Josh Ziegler Jan 09 '21 at 01:02
  • 1
    Sorry, df.transform only acts series by series, but df.apply(func) may work. – Josh Ziegler Jan 09 '21 at 01:13

1 Answers1

0

Since your batches are identified by batch_id you can iterate over all the unique batch_id's and add suitable entries to "new feature" column only for the currently iterating batch.

### First create an empty column 
sample["new feature"] = np.nan 

### iterate through all unique id's 
for id in sample["batch id"].unique():
    batch = samples.loc[samples["batch id"] == id]
    # do your computations 
    samples.loc[samples["batch id"] == id, "new feature"] =  # your computed value 
Sanketh B. K
  • 759
  • 8
  • 22