0

I'm using Python 2.7.3, and here is my code to parse a website read into 'file':

list = re.findall(r'<span info=".+</span>| \
          Name: .+<br>| \
          <span id="Phone" info="phonenumber">.+</span>| \
          ',file)

My actual code is longer than 4 lines, but this should get the point across. I'm trying to write this on separate lines so it is easier for me to read/debug, but as it stands now nothing is being stored into list.

I've tried moving the first few expressions onto one line and it works fine. What am I doing wrong?

avasal
  • 14,350
  • 4
  • 31
  • 47
user1956609
  • 2,132
  • 5
  • 27
  • 43
  • 1
    `list` is a builtin, use a different name for the variable – avasal Feb 27 '13 at 07:38
  • Two things: Don't name a list `list`, it gets mixed up with the function `list()` ;P. Second, don't parse HTML with regex (Look at the answer here http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags). A good module to use is `BeautifulSoup` – TerryA Feb 27 '13 at 07:39
  • 1
    If the HTML he intends to parse has a certain structure, it is a regular language and can be parsed with regex. – LtWorf Feb 27 '13 at 07:43
  • 1
    When you write your string on seperate lines, do not indent the following lines, the indentation will be preserved in the string. Also avoid using names of builtins for your objects, like `list` and `file`. –  Feb 27 '13 at 07:43

1 Answers1

4

Use multiline strings and make the regex verbose:

mylist = re.findall(r'''(?x)                    # verbose mode
                        <span\ info=".+</span>| # allows you to comment the regex
                        Name:\ .+<br>|          # for even better readability
                        <span\ id="Phone"\ info="phonenumber">.+</span>''', file)

You will have to escape the spaces, though, since whitespace is ignored in verbose regexes.

Your solution failed because the whitespace introduced by the indentation became part of the regex (and since it wasn't a verbose regex, it's significant whitespace).

Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561