1

I applogize in advance as this question is partly technical and partly understanding better pandas/python design.

Here's what I'm trying to do. Give this data:

a,b
1,2
1,3
2,4
2,5

I want to create a third column ['c'] with some data based on each column and grouping of A. The way I have done it is:

for item in df.a.unique(): 
   dataSetToProcess = df.loc[df['a'] == item][['a', 'b']]
   dataSetToProcess['c'] = dataSetToProcess.apply(MyFunction)

with this approach, I have a value for C but it's only on dataSetToProcess which is not part of the main DF. I would like to have the value of C available in the main DF.

my expected results(for simplicity of example, let's say each group of column A just averages the group itself and adds it to column C):

a,b,c
1,2, 2.5
1,3, 2.5
2,4, 4.5
2,5, 4.5

My two thoughts were to take each result and map column A,B to the original DF but was wondering if there was an easier/cleaner approach?

Lostsoul
  • 25,013
  • 48
  • 144
  • 239
  • Can you edit your question and put there sample of `dataSetToProcess` and expected result? – Andrej Kesely Oct 30 '20 at 12:33
  • @AndrejKesely Thank you. I have updated with a simplified expected result. In my forLoop, datasettoprocess is just created based on unique values on column A. I need this because my function needs the entire subgroup of data. – Lostsoul Oct 30 '20 at 12:56

1 Answers1

1

Let's suppose you have two dataframes, df1 and dataSetToProcess:

df1 = pd.DataFrame({"a": [1, 1, 2, 2], "b": [2, 3, 4, 5]})
dataSetToProcess = pd.DataFrame({"a": [1, 2], "c": [2.5, 4.5]})

print(df1)
print(dataSetToProcess)

   a  b
0  1  2
1  1  3
2  2  4
3  2  5
   a    c
0  1  2.5
1  2  4.5

Then:

df1 = df1.merge(dataSetToProcess, on="a")
print(df1)

Prints:

   a  b    c
0  1  2  2.5
1  1  3  2.5
2  2  4  4.5
3  2  5  4.5

From there, you can compute your variable based on a, b and c columns.

Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91