I want to extract numbers using regular expression
df['price'][0]
has
'[<em class="letter" id="infoJiga">3,402,000</em>]'
And I want to extract 3402000
How can I get this in pandas dataframe?
I want to extract numbers using regular expression
df['price'][0]
has
'[<em class="letter" id="infoJiga">3,402,000</em>]'
And I want to extract 3402000
How can I get this in pandas dataframe?
However the value is a string, try the below code.
#your code
df['price'][0] returns '[<em class="letter" id="infoJiga">3,402,000</em>]'
let us say this is x.
y = ''.join(c for c in x.split('>')[1] if c.isdigit()).strip()
print (y)
output: 3402000
Hope it works.
The simplest regex assuming nothing about the environment may be ([\d,]*)
. Than you can pandas' to_numeric function.
Are all your values formatted the same way? If so, you can use a simple regular expression to extract the numeric values then convert them to int
.
import pandas as pd
import re
test_data = ['[<em class="letter" id="infoJiga">3,402,000</em>]','[<em class="letter" id="infoJiga">3,401,000</em>]','[<em class="letter" id="infoJiga">3,400,000</em>]','[<em class="letter" id="infoJiga">2,000</em>]']
df = pd.DataFrame(test_data)
>>> df[0]
0 [<em class="letter" id="infoJiga">3,402,000</em>]
1 [<em class="letter" id="infoJiga">3,401,000</em>]
2 [<em class="letter" id="infoJiga">3,400,000</em>]
3 [<em class="letter" id="infoJiga">2,000</em>]
Name: 0, dtype: object
Define a method that extracts and returns to integer
def get_numeric(data):
match = re.search('>(.+)<', data)
if match:
return int(match.group(1).replace(',',''))
return None
Apply it to DataFrame
df[1] = df[0].apply(get_numeric)
>>> df[1]
0 3402000
1 3401000
2 3400000
3 2000
Name: 1, dtype: int64