0

I am trying to extract number from thounsands of rows and have been using below code, I have been able to extract the numbers from the string but unable to replace it in the column.

df_consol.loc[9122:,'FDD - 1800.1'].head(10)

9122 15 MHz 9123 15 MHz 9124 15 MHz 9125 15 MHz 9126 15 MHz 9127 15 MHz 9128 15 MHz 9129 15 MHz 9130 15 MHz 9131 15 MHz

New_Col=df_consol['FDD - 1800.1'].str.extract(r'^(\d{2})',expand=False)

OUTPUT 9122 15 9123 15 9124 15 9125 15 9126 15 9127 15 9128 15 9129 15 9130 15 9131 15

df_consol['FDD - 1800.1']=df_consol['FDD - 1800.1'].str.extract(r'^(\d{2})',expand=False)

above code is unable to replace the existing columns of the dataframe

1 Answers1

0

There exists the dedicated method .str.replace() for this:

import pandas as pd
# create toy data
df = pd.DataFrame(['9122 15 MHz','9123 15 MHz','9124 15 MHz', '9125 15 MHz','9126 15 MHz','9127 15 MHz','9128 15 MHz','9129 15 MHz','9130 15 MHz','9131 15 MHz'],columns=['col1'])

# replace strings in data
df['col1'].str.replace(' MHz','unit')

output In this snippet, I replaced the string ' MHz' with the term 'unit'. You can also replace them with an empty string, if you want to get rid of the unit.

Perhaps, you want to extract the unit an convert the column to a numeric value. Do this with (if you have not put your line-number within the toy-example data as I did^^):

df = pd.DataFrame(['15 MHz','15 MHz','15 MHz', '15 MHz','15 MHz','15 MHz','15 MHz','15 MHz','15 MHz','15 MHz'],columns=['col1'])

df['col1'].str.replace(' MHz','').astype('int64')

converted output

EDIT:

If you have strings of variable length with a variable number of numbers that you want to extract, you can either use re (regular expression aka. regex) or the build-in string-to-integer function int. For both, see the answers to this question. I have created an example with both:

import pandas as pd
# create toy data
df = pd.DataFrame(['9122 15 MHz 9123 15 MHz 9124 15 MHz', '9125 15 MHz 9126 15 MHz 9127 15 MHz 9128 15 MHz9129 15 MHz 9130 15 MHz','9131 15 MHz'],columns=['col1'])

# nativ string-to-number conversion:
lst = []
for i,row in df.iterrows():
    # get each (positive) number in string
    lst.append( [int(s) for s in row['col1'].split() if s.isdigit()] )
# add new list to DataFrame
df['num'] = lst

# regular expression conversion:
import re
lst = []
for i,row in df.iterrows():
    lst.append( re.findall(r'\d+', row['col1']) )
# add new list to DataFrame
df['re'] = lst

string2num

max
  • 3,915
  • 2
  • 9
  • 25
  • Hi Max ,thanks for replying,I need to extract the number from the entire string so tge string could by '15Mhz',it could be '15M',Mhz15,M15,GHz 15, and post the number is extracted I need to replace in the original location – Sushanto Banerjee Dec 04 '20 at 13:11
  • OK, I am not sure if I got this right. You want to extract all numbers in a string and do what with them? Why is such a string in a pandas DataFrame? – max Dec 05 '20 at 07:09
  • Ya exactly basically extract any number in the df column ,then replace the column with those numbers extracted,basically I facing challenge that a particular column is not stream lined ,it needs to be in integer format but human error resulted in text string as well,so just want to extract the numbers from the given column and replace tha same no in its place (where the original string is from where the number is extracted from) – Sushanto Banerjee Dec 06 '20 at 11:43
  • @SushantoBanerjee OK, but then you are basically asking [How to extract numbers from a string in Python?](https://stackoverflow.com/questions/4289331/how-to-extract-numbers-from-a-string-in-python#4289557) I added an example for you to this question – max Dec 07 '20 at 07:01
  • HI max,So I extracted the numbers using regex as well,but i was not able to replace them in the original columns,if u see its in my exaple as well. – Sushanto Banerjee Dec 09 '20 at 11:52
  • No, of course not. You cant (re)place a column of strings with a list of numerics. If you want to replace it, you need to replace the entire column, because the element type is defined by the column (its not a generic container) – max Dec 09 '20 at 14:07