0

I want to extract numbers using regular expression

df['price'][0] 

has

'[<em class="letter" id="infoJiga">3,402,000</em>]'

And I want to extract 3402000

How can I get this in pandas dataframe?

ᴀʀᴍᴀɴ
  • 4,443
  • 8
  • 37
  • 57
kavin
  • 9
  • 4
  • 3
    Looks like parsing html with regex to me - naughty developer... https://stackoverflow.com/a/1732454/2071828 – Boris the Spider Aug 18 '18 at 11:37
  • 1
    Welcome to stackoverflow. Show data, your desired output and what you tried. You also might want to read [MCVE](https://stackoverflow.com/help/mcve) – Quickbeam2k1 Aug 18 '18 at 11:37

3 Answers3

0

However the value is a string, try the below code.

#your code    
df['price'][0] returns  '[<em class="letter" id="infoJiga">3,402,000</em>]'

let us say this is x.

y = ''.join(c for c in x.split('>')[1]  if c.isdigit()).strip()
print (y)

output: 3402000

Hope it works.

Raju
  • 93
  • 8
  • if this code is working fine, then accept the answer and upvote for others to use it. – Raju Aug 20 '18 at 05:28
0

The simplest regex assuming nothing about the environment may be ([\d,]*). Than you can pandas' to_numeric function.

Jónás Balázs
  • 781
  • 10
  • 24
0

Are all your values formatted the same way? If so, you can use a simple regular expression to extract the numeric values then convert them to int.

import pandas as pd
import re

test_data = ['[<em class="letter" id="infoJiga">3,402,000</em>]','[<em class="letter" id="infoJiga">3,401,000</em>]','[<em class="letter" id="infoJiga">3,400,000</em>]','[<em class="letter" id="infoJiga">2,000</em>]']
df = pd.DataFrame(test_data)
>>> df[0]
0    [<em class="letter" id="infoJiga">3,402,000</em>]
1    [<em class="letter" id="infoJiga">3,401,000</em>]
2    [<em class="letter" id="infoJiga">3,400,000</em>]
3        [<em class="letter" id="infoJiga">2,000</em>]
Name: 0, dtype: object

Define a method that extracts and returns to integer

def get_numeric(data):
    match = re.search('>(.+)<', data)
    if match:
        return int(match.group(1).replace(',',''))    
    return None

Apply it to DataFrame

df[1] = df[0].apply(get_numeric)
>>> df[1]
0    3402000
1    3401000
2    3400000
3       2000
Name: 1, dtype: int64
Xnkr
  • 564
  • 5
  • 16