0

I have a data like this example:

df11 = pd.DataFrame({'code': [33000000, 33230000, 33235600, 33235678, 17000000,17980000],
                 'Name': ['txt1','txt2','txt3','txt4','txt5','txt6'],
                'level': [1,2,3,4,3,4]})
print(df11)

My aim is to iterate over the rows (in reality about 100,000 rows) and create a new feature combined of names ONLY WHEN the level==4. So eventually the output should be like:

code       combined_names
33235678   txt1+txt2+txt3+txt4
17980000   txt5+txt6

The 8-digits codes are always associated with levels, always the code for level 1 is e.g 33000000 then 2 more digits are added up for level 2 e.g 33230000 and so on. The codes might be everywhere in the dataframe and NOT necessarily consecutive rows however always with this logic.

I have done the following which is correct up to the first occurrence of condition level=4 (to check replace the 2nd level=4 with e.g 9). But in reality there are more condition level=4, and I get the following error:

def combined_names(code):
    code_list=[(code-code%10**x) for x in [6,4,2,0]]

    #above I obtain the codes correspond to level 1 to level4 when level=4 is 
    #satisfied, by difference and modulo operator to 10**6, 
    #10**4,10**2,1. e.g For 33235678 as input we get: 
    #33000000,33230000,33235600 and 33235678   

    print(code_list)
    name1=df11.query('code == @code_list[0]')['Name'].tolist()
    name2=df11.query('code == @code_list[1]')['Name'].tolist()
    name3=df11.query('code == @code_list[2]')['Name'].tolist()
    name4=df11.query('code == @code_list[3]')['Name'].tolist()

    name_list=name1+name2+name3+name4
    print(name_list)

    all_names= ' + '.join(name_list)
    return all_names
combined_names(33235678)

conditions = [df11['level'] == 4]
choices = [combined_names( df11.query('level==4')['code'].item() )] 
# problem: if there are more than one level4, then it does NOT iterate over.

# CHECK : https://stackoverflow.com/questions/16476924/how-to-iterate-over-rows-in-a-dataframe-in-pandas

df11['all_names'] = np.select(conditions, choices, default='NaN')
print(df11) 

ValueError: can only convert an array of size 1 to a Python scalar

How to modify the function to catch ALL desired conditions in df? In general, what is a more optimized way to do this task? Thank you!

physiker
  • 889
  • 3
  • 16
  • 30

1 Answers1

0

Create a temporary column that checks if level is 4

df11['level_4'] = df11.loc[df11['level']==4,'Name']

Backward fill to associate with previous rows

df11 = df11.bfill()

Groupby on level_4 and aggregate using string cat

M = df11.groupby('level_4').Name.agg(lambda x: x.str.cat(sep='+'))
M = M.rename('combined_names')

merge back to original dataframe

     (df11[['code','Name']]
      .merge(M,left_on='Name',right_on='level_4')
      .drop('Name',axis=1)
      )

    code    combined_names
0   33235678    txt1+txt2+txt3+txt4
1   17980000    txt5+txt6
sammywemmy
  • 27,093
  • 4
  • 17
  • 31
  • Thanks. great didnt know about bfill(), but really it is not always previous rows to fill in with it. The logic is to search that pattern in code_list which I have written. can I include that? – physiker Mar 15 '20 at 10:17
  • 1
    if u could, edit ur original question, and include an explanation of ur code, so others may contribute as well. – sammywemmy Mar 15 '20 at 10:28
  • I added more info. – physiker Mar 15 '20 at 10:56