2

I have a DataFrame like this:

         A   B
0  name1_X   2
1  name2_X   2
2  name3_X   2
3  name1_Y NaN
4  name2_Y NaN
5  name3_Y NaN

where column A is the name with a suffix _X or _Y, and column B is a value.

I want to make rows with _Y equal to

`-1 * corresponding `_X`

with the same name.

The output should be,

         A   B
0  name1_X   2
1  name2_X   2
2  name3_X   2
3  name1_Y  -2
4  name2_Y  -2
5  name3_Y  -2

Sometimes, the DataFrame will be

         A   B
0  name1_X   2
1  name1_Y NaN
2  name2_Y NaN
3  name3_Y NaN

and the output should be:

         A   B
0  name1_X   2
1  name1_Y  -2
2  name2_Y NaN
3  name3_Y NaN

Non-overlapping part remains NaN

How can I solve this problem in a simple way?

LondonRob
  • 73,083
  • 37
  • 144
  • 201
Simon Beckham
  • 67
  • 1
  • 4
  • Well, what have you tried? – Alexander Dec 28 '15 at 22:51
  • Much of this problem will be solved with `str.split` as shown [here](http://stackoverflow.com/a/29370709/2071807) – LondonRob Dec 28 '15 at 23:30
  • At first, I want to add one column "C", which deletes the suffix of column "A". It shows like `["name1","name2","name3","name1","name2","name3"]`. Then I `groupby` column "C" , and use `.loc` to set the name(i)_Y equal to -1*name(i)_X. I think it can solve the problem, but maybe not the best solution. – Simon Beckham Dec 29 '15 at 03:20

3 Answers3

2

Separate the 'X' values from the dataframe and create corresponding 'Y' values. Then just concatenate.

I've extended the example dataframe to consider cases where the names do not match. In this case, if the name ends with '_X', a new '_Y' name variable will be created. If there is a name '_Y' but no corresponding name '_X', then it will be left unchanged.

df = pd.DataFrame({'A': ['name0_X', 'name2_X', 'name3_X', 'name1_Y', 'name2_Y', 'name4_Y'],
                   'B': [2.0, 2.0, 2.0, None, None, None]})
>>> df
         A   B
0  name0_X   2  # Create new `name0_Y`
1  name2_X   2  # Match. Multiply value by minus one.
2  name3_X   2  # Create new `name3_Y`
3  name1_Y NaN  # No match. Leave as is.
4  name2_Y NaN  # Update with matching X after multiplying by minus one.
5  name4_Y NaN  # No match. Leave as is.

dfx = df[df.A.str[-2:] == '_X']
dfy = dfx.copy()
dfy.B *= -1
dfy.A = dfy.A.str[:-2] + '_Y'
y_names = df[df.A.str[-2:] == '_Y'].A
missing_Y_names = y_names[~y_names.isin(dfy.A)]
df_missing_y = df.loc[df.A.isin(missing_Y_names), :]
result = pd.concat([dfx, dfy, df_missing_y], ignore_index=True)

>>> result
         A   B
0  name0_X   2
1  name2_X   2
2  name3_X   2
3  name0_Y  -2
4  name2_Y  -2
5  name3_Y  -2
6  name1_Y NaN
7  name4_Y NaN
Alexander
  • 105,104
  • 32
  • 201
  • 196
0

The first thing to do is to separate the name part and the X part by splitting on the _:

In [12]: df[['name', 'xy']] = df.A.str.split('_', return_type='frame')

In [13]: df
Out[13]: 
         A   B   name xy
0  name1_X   2  name1  X
1  name2_X   2  name2  X
2  name3_X   2  name3  X
3  name1_Y NaN  name1  Y
4  name2_Y NaN  name2  Y
5  name3_Y NaN  name3  Y

You can now perform operations using the information which was previously locked away in that one field name1_X.

For example:

In [16]: df.set_index(['name', 'xy']).B.unstack('xy')
Out[16]: 
xy     X   Y
name        
name1  2 NaN
name2  2 NaN
name3  2 NaN

Now it's easy to set Y to be related to X in any way you wish:

In [17]: df2 = df.set_index(['name', 'xy']).B.unstack('xy')

In [18]: df2.Y = df2.X * -1

In [19]: df2
Out[19]: 
xy     X  Y
name       
name1  2 -2
name2  2 -2
name3  2 -2
LondonRob
  • 73,083
  • 37
  • 144
  • 201
  • This was my original approach, but you run into an issue where, for example, `name4_Y = NaN`. Once you unstack, column 'X' will also be NaN. I believe the end result should have `name4_Y = NaN` but should not have `name4_X = NaN`. – Alexander Dec 29 '15 at 00:05
  • So what should `name4_X` *be* in your scenario? If you mean it shouldn't exist at all you could always drop the rows where `X` is `Nan`. – LondonRob Dec 29 '15 at 00:06
  • I don't believe it should be in the result, but I leave it to the OP to clarify. I originally thought a missing Y name should also be excluded, but the question says to include them. – Alexander Dec 29 '15 at 00:08
  • Sorry! I didn't check carefully: I assumed you *were* the OP! – LondonRob Dec 29 '15 at 00:08
-1

My suggestion is merge the X data to Y data on axis 1, calculate and rebuild. This worked for me:

import pandas as pd
import numpy as np

df = pd.DataFrame()
df['A'] =   ['name1_X','name2_X','name3_X','name1_Y','name2_Y','name3_Y','name4_Y','name5_Y']
df['B'] = [2,2,2,np.nan,np.nan,np.nan,np.nan,np.nan]

#name column
df['name'] = [x.split('_')[0] for x in df.A]

#suffix column
df['suf'] =  [x.split('_')[1] for x in df.A]

#create y data and add same name x data
dfy = pd.merge(df[df.suf=='Y'],df[df.suf=='X'],on='name',how='left')

#preform calculation
dfy.B_x = [x*-1 for x in dfy.B_y]

#build output
output = pd.DataFrame(np.vstack([df[df.suf=='X'].values[:,:2],dfy.values[:,:2]]),columns=['A','B'])
Ezer K
  • 3,637
  • 3
  • 18
  • 34