0

I have been searching this forum for close match of my problem but could not locate suitable solution, so posting the query.

Am using urllib and re modules to extract certain sections of webpage. What is of interest is also the status associated with those sections.

For example, looking at the source of the webpage :

MY-TEXT #1410 finished subtask PREPARE-WORKSPACE #340418: https://cloud6.foo.bar.com/b/job/PREPARE-WORKSPACE/340418

'>SUCCESS

Am using re.compile and re.findall to extract text coming after this pattern "https://cloud6.foo" ; this matches all the text and using this list I have confirmed it is so ; but am loosing out on the status of this particular task because it is in the line immediate after the "https://" line.

How to extract one line after the matched string in the current scenario ?

Code snippet is here :

from urllib import urlopen
import re

webpage = urlopen(urllink).read()
buildPhases = re.compile(r'\<a href=\W{1}https\W{3}(.*)')
phaseLists = re.findall(buildPhases, webpage)
for item in phaseLists:
    print item
styvane
  • 59,869
  • 19
  • 150
  • 156
Ramu
  • 3
  • 1
  • 5
  • 3
    If you're parsing HTML, *use an HTML parser!* – jonrsharpe Nov 12 '15 at 14:26
  • 1
    To expand on jonrsharpes comment , try BeautifulSoup. – durrrutti Nov 12 '15 at 14:27
  • 1
    As stated in the comments above use an html parser to do this work (otherwise [tony the pony comes for you](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags). If you edit the question providing the html code (or the link) you're dealing with, we can provide an appropriate (BeautifulSoup or lxml) solution. – Giuseppe Ricupero Nov 12 '15 at 14:37

1 Answers1

0

To extract a line after matching string you need to add .*\n in you regex.
For example if we take:

MY-TEXT #1410 finished subtask PREPARE-WORKSPACE #340418: https://cloud6.foo.bar.com/b/job/PREPARE-WORKSPACE/340418

'>SUCCESS

and apply this pattern r'https.*\n.*\n.*' the result should be the above string without:

MY-TEXT #1410 finished subtask PREPARE-WORKSPACE #340418:

Kenly
  • 24,317
  • 7
  • 44
  • 60
  • Thanks to all of you who have responded. I could have used HTML parsers but due to limitation of module unavailability restored to regex. – Ramu Nov 15 '15 at 17:08