0

I'm trying to extract some lines from an HTML source file. The one below is simplified but it's the same idea. Using the sample below, I am trying to get it to output in numerical order...that is Form 1, Form 2, Form 3, Form 4. The problem is that the second loop restarts at the second round. So I get: Form 1, Form 2, Form 3, Form 2. How can I edit so that the second loop continues to extract the Form 4 text?

Code

import re

line = 'bla bla bla<form>Form 1</form> some text...<form1>Form 
2</form1> more text?bla bla bla<form>Form 3</form> some text...
<form1>Form 4</form1> more text?'

for match in re.finditer('<form>(.*?)</form>', line, re.S):
  print match.group(1)
  for match1 in re.finditer('<form1>(.*?)</form1>', line, re.S):
    print match1.group(1)
    break
Community
  • 1
  • 1
Jane
  • 3
  • 1
  • 4

3 Answers3

1

Is this what you want?

>>> for item in re.finditer(r'<form[12]?>([^<]+)',line):
...     item.groups()[0]
...     
'Form 1'
'Form 2'
'Form 3'
'Form 4'

If it is, just don't tell anyone that it was my idea to use regex for HTML.

Bill Bell
  • 21,021
  • 5
  • 43
  • 58
0
for match in re.finditer('<form1?>(.*?)</form1?>', line, re.S):
    print(match.group(1))

I modify the code:

for match in re.finditer('(<form>(.*?)</form>)|(<form1>(.*?)</form1>)', line, re.S):
    if None != match.group(4):
        print(match.group(4))
    else:
        print(match.group(2))
William Feirie
  • 624
  • 3
  • 7
  • That's what I am using. But notice that there are two patterns...
    and that alternate. So the problem is how to loop them so that it searches in the following order: , , , .
    – Jane Mar 01 '18 at 03:40
  • Yes,the first code has the problem.I modify the code. – William Feirie Mar 01 '18 at 04:15
0

The returned match object has a method start which takes the index of the desired group and returns the starting index of the matched group in the string (i.e. line). And then you can let the inner loop to start at that index rather than the begin of line by slicing line (e.g. line[some_index:]). A more proper and simple way is to just let your inner re.finditer take match.group(1) instead of line.

However, it is generally not a good idea to manually handle HTML unless the pattern of targeted HTML is simple enough. You may use some easy-to-use while sophisticated library for parsing and extracting data from HTML.

Dummmy
  • 688
  • 1
  • 5
  • 11