0

I have a dataframe that looks like:

 data1 = [{'price2022': "12014\n205****", 'company': "toyota",'price2023': "10014\n180****"}, 
    {'price2022': "22018", 'company': "apple",'price2023': "22018"}, 
    {'price2022': "32020", 'company': "general electric",'price2023': "31020"}, 
    {'price2022': "80170", 'company': "alibaba",'price2023': "83170"}
   ]
 df1 = pd.DataFrame(data1)

The first value is a multiline string, which also contains the redundant string ''. Instead of the multiline string "12014\n205", I would like to have a single line number that is the sum of the two lines. (12014+205=12219).

I could try something like this:

dfa[['b', 'c']] = df1["price2022"].apply(lambda x: pd.Series(str(x).split("\n")))
dfa['c'] = dfa['c'].map(lambda x: str(x)[:-4])   #gets rid of the ****, probably not the smartest method
dfa['b']=  dfa['b'].astype('int')
dfa['c'].replace('', 0, inplace=True)
dfa['c']=  dfa['c'].astype('int')
dfa['d']=dfa['b']+dfa['c']

However, this seems incredibly inefficient. Not to mention that I have several 'price' columns I need to run through. Creating new variables for each seems like a bad way to deal with this. Is there a more efficient way to do this without creating multiple new columns? How would I extend this such that I don't have to go have a look which columns have these multi lines and which don't, but the code just runs through all?

Peter
  • 441
  • 1
  • 6
  • 21

2 Answers2

1

You can loop over you columns and apply a function that will split/sum the prices:

import pandas as pd

data1 = [{'price2022': "12014\n205****", 'company': "toyota",'price2023': "10014\n180****"}, 
    {'price2022': "22018", 'company': "apple",'price2023': "22018"}, 
    {'price2022': "32020", 'company': "general electric",'price2023': "31020"}, 
    {'price2022': "80170", 'company': "alibaba",'price2023': "83170"}
   ]
df1 = pd.DataFrame(data1)

for col in df1.columns:
    if col.startswith('price'):
        df1[f'{col}_sum'] = df1[col].apply(lambda x: sum(map(int, x.strip('****').split('\n'))))

print(df1)

Output:

        price2022           company       price2023  price2022_sum  price2023_sum
0  12014\n205****            toyota  10014\n180****          12219          10194
1           22018             apple           22018          22018          22018
2           32020  general electric           31020          32020          31020
3           80170           alibaba           83170          80170          83170
Tranbi
  • 11,407
  • 6
  • 16
  • 33
0

You can try by:

df["price2022"] = df["price2022"].str.replace("\n", "+").apply(lambda row: eval(row.replace("****", "")))

It gives:

df["price2022"].head()

enter image description here

  • using `eval` is avoidable here. See [why-is-using-eval-a-bad-practice](https://stackoverflow.com/questions/1832940/why-is-using-eval-a-bad-practice) – Tranbi Jun 12 '23 at 08:22
  • As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Jun 12 '23 at 11:10