0

I am trying to split a column based on type. I want to show numbers separate from text.

I have tried to add it without a loop, but the shape is different. I therefore resorted to loop it through. It is however only giving me the last number in all fields

Python input:

newdf = pd.DataFrame()
newdf['name'] = ('leon','eurika','monica','wian')
newdf['surname'] = ('swart38','39swart','11swart','swart10')
a = newdf.shape[0]

newdf['age'] = ""
for i in range (0,a):
    newdf['age'] =  re.sub(r'\D', "",str(newdf.iloc[i,1]))

print (newdf)

I am expecting the age column to show 38,39,11,10. The answer is however all "10" being the last field.

Out:

     name  surname age
0    leon  swart38  10
1  eurika  swart39  10
2  monica  11swart  10
3    wian  swart10  10
DjaouadNM
  • 22,013
  • 4
  • 33
  • 55
Leon Swart
  • 27
  • 5
  • your code would work (although it would not be very performant because of the for loop) had you replaced `newdf['age']` with `newdf.loc[i, 'age']` – godfryd Sep 05 '19 at 08:36

2 Answers2

1

It is because you are assigning new values to newdf['age'] in every iteration of the for loop, in which the last assignment was 10.

You can fix it by indexing:

a = newdf.shape[0]
newdf['age'] = ""
for i in range (0,a):
    newdf['age'][i] =  re.sub(r'\D', "",str(newdf.iloc[i,1]))
    #           ^^^

Or instead, use pandas.Series.str.extract:

newdf['age'] = newdf['surname'].str.extract('(\d+)')
print(newdf)

Output:

     name  surname age
0    leon  swart38  38
1  eurika  39swart  39
2  monica  11swart  11
3    wian  swart10  10
Chris
  • 29,127
  • 3
  • 28
  • 51
0

Try using Series.str.replace:

newdf['age'] = newdf['surname'].str.replace(r'\D+', '')
Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360