1

I have a large dataframe (df) and in the last column, all of the elements are showing up as

1055.0000.0

so the last 2 characters are always ".0". Whats the most efficient way to do this? the last columns name is always different so im not sure how to approach this. I have tried to loop over the pandas df but it takes too much memory and breaks the code. is there a way to do something like

df[ last column ] = df[ last column - last 2 characters]

or make a new df then append it in?

3 Answers3

3

Vectorized operations are almost always faster. .str method allows pandas to vectorize strings

df["last_col"].str[:-2]

Can time it using %%timeit magic command in jupyter notebook.

%%timeit
df.iloc[:, -1].str[-2:]
>>> 352 µs ± 4.68 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%%timeit
df["last_col"].str[:-2]
>>> 242 µs ± 4.76 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
haneulkim
  • 4,406
  • 9
  • 38
  • 80
0

Try with the str accessor:

df.iloc[:, -1] = df.iloc[:, -1].astype(str).str[-2:].astype(int)
U13-Forward
  • 69,221
  • 14
  • 89
  • 114
  • This is very efficient thank you. How would you then go about making all the elements of the column into integers? – JohnCena1997 Oct 07 '21 at 09:55
  • Very good thank you. One last thing, I am getting an error saying "Can only use .str accessor with string values, which use np.object_ dtype in pandas", I am assuming this means that the input eg 1055.0000.0 is not recognised as a string? how would I change the whole column beforehand? – JohnCena1997 Oct 07 '21 at 10:03
0

You could also use rsplit:

s = '105.0000.0'
s.rsplit('.0', 1)[0]

output:

105.0000
BlackMath
  • 1,708
  • 1
  • 11
  • 14