How to extract exact position using regular expression in python?

Question

Possible Duplicate:
Python Regex Use - How to Get Positions of Matches

I am new to python. I have written program where I used regular expression to extract the exact number from webpage using command line arguments.First argument should be 'Amount' second should be 'From' third should be 'To'.I should extract exact number from site "http://www.xe.com/ucc/convert/?Amount=1&From=INR&To=USD" where the converted amount should be extracted. The code is:

import requests
import re
import sys

amount=sys.argv[1]
from_=sys.argv[2]
to=sys.argv[3]
r = requests.get("http://www.xe.com/ucc/convert/?Amount=%(amount)s&From=%(from_)s&To=%(to)s"%{"amount":amount,"from_":from_,"to":to})
dataCrop=re.findall('[0-9,]+\.[0-9]+',r.text)
if amount<'1':
    print dataCrop[15]
else:
    print dataCrop[11]

But the problem is I should not use exact position that is

if amount<'1':
    print dataCrop[15]
else:
    print dataCrop[11]

Instead of that I should modify my regular expression. How can I write regular expression for this? I cant use beautiful soup.

"Automated extraction of rates is prohibited under the Terms of Use." — jfs, Aug 29 '12 at 05:53

score 1 · Answer 1 · answered Aug 29 '12 at 05:37

1

The re.search method returns a MatchObject. You can use its span method to find the position of the match. Hope this helps :-)

answered Aug 29 '12 at 05:37

Raymond Hettinger

216,523
63
388
485

I used re.search as 'dataCrop=re.search(r'[0-9,]+\.[0-9]+',r.text,flags=0)' and i got output '<_sre.SRE_Match object at 0xd26ac0>' and then when I use 'print dataCrop.group(0)' I got output as 1.0.I didnt get exact answer. – user1632091 Aug 29 '12 at 06:03
@user1632091 Raymond's answer is suggesting you look at the result of `dataCrop.span(0)` rather than `dataCrop.group(0)`. – lvc Aug 29 '12 at 08:09

score 1 · Answer 2 · answered Aug 29 '12 at 07:47

Position where a regex matches is not very useful info in your case. Though as @Raymond Hettinger suggested it is easily accessible via re.MatchObject.

You could split your task into three steps.

Construct web-page's url

import sys
import urllib

if len(sys.argv) != 4:
    sys.exit(2)

params = urllib.urlencode(zip("Amount From To".split(), sys.argv[1:]))
url = "http://example.com/path/?" + params

urlencode() provides proper encoding of sys.argv.

Retrieve web-page

from selenium.webdriver import Firefox as Browser # pip install selenium

browser = Browser()
try:
    browser.implicitly_wait(3) # seconds
    browser.get(url)
    page = browser.page_source
finally:
    browser.quit() # quit no matter what

selenium.webdriver takes care of pages generated using javascript.

Find relevant data in it

import re

print re.findall(r'(\d+.\d+).*?"uniq_class_near_data"', page)

It will break if the page markup changes.

Here's BeautifulSoup variant for comparison:

from bs4 import BeautifulSoup # pip install beautifulsoup4

soup = BeautifulSoup(page)
print [span.find_previous_sibling(text=re.compile(r'\d+.\d+')).strip()
       for span in soup('span', class_="uniq_class_near_data", limit=2)]

How to extract exact position using regular expression in python?

2 Answers2

Construct web-page's url

Retrieve web-page

Find relevant data in it