0

Possible Duplicate:
Python Regex Use - How to Get Positions of Matches

I am new to python. I have written program where I used regular expression to extract the exact number from webpage using command line arguments.First argument should be 'Amount' second should be 'From' third should be 'To'.I should extract exact number from site "http://www.xe.com/ucc/convert/?Amount=1&From=INR&To=USD" where the converted amount should be extracted. The code is:

import requests
import re
import sys

amount=sys.argv[1]
from_=sys.argv[2]
to=sys.argv[3]
r = requests.get("http://www.xe.com/ucc/convert/?Amount=%(amount)s&From=%(from_)s&To=%(to)s"%{"amount":amount,"from_":from_,"to":to})
dataCrop=re.findall('[0-9,]+\.[0-9]+',r.text)
if amount<'1':
    print dataCrop[15]
else:
    print dataCrop[11]

But the problem is I should not use exact position that is

if amount<'1':
    print dataCrop[15]
else:
    print dataCrop[11]

Instead of that I should modify my regular expression. How can I write regular expression for this? I cant use beautiful soup.

Community
  • 1
  • 1

2 Answers2

1

The re.search method returns a MatchObject. You can use its span method to find the position of the match. Hope this helps :-)

Raymond Hettinger
  • 216,523
  • 63
  • 388
  • 485
  • I used re.search as 'dataCrop=re.search(r'[0-9,]+\.[0-9]+',r.text,flags=0)' and i got output '<_sre.SRE_Match object at 0xd26ac0>' and then when I use 'print dataCrop.group(0)' I got output as 1.0.I didnt get exact answer. – user1632091 Aug 29 '12 at 06:03
  • @user1632091 Raymond's answer is suggesting you look at the result of `dataCrop.span(0)` rather than `dataCrop.group(0)`. – lvc Aug 29 '12 at 08:09
1

Position where a regex matches is not very useful info in your case. Though as @Raymond Hettinger suggested it is easily accessible via re.MatchObject.

You could split your task into three steps.

Construct web-page's url

import sys
import urllib

if len(sys.argv) != 4:
    sys.exit(2)

params = urllib.urlencode(zip("Amount From To".split(), sys.argv[1:]))
url = "http://example.com/path/?" + params

urlencode() provides proper encoding of sys.argv.

Retrieve web-page

from selenium.webdriver import Firefox as Browser # pip install selenium

browser = Browser()
try:
    browser.implicitly_wait(3) # seconds
    browser.get(url)
    page = browser.page_source
finally:
    browser.quit() # quit no matter what

selenium.webdriver takes care of pages generated using javascript.

Find relevant data in it

import re

print re.findall(r'(\d+.\d+).*?"uniq_class_near_data"', page)

It will break if the page markup changes.

Here's BeautifulSoup variant for comparison:

from bs4 import BeautifulSoup # pip install beautifulsoup4

soup = BeautifulSoup(page)
print [span.find_previous_sibling(text=re.compile(r'\d+.\d+')).strip()
       for span in soup('span', class_="uniq_class_near_data", limit=2)]
jfs
  • 399,953
  • 195
  • 994
  • 1,670