-2
content='<tr><td style="text-align:center;" height="30">12090043</td>'+\
        '<td style="text-align:left;">CourseA</td>'+\
        '<td style="text-align:center;">3</td>'+\
        '<td style="text-align:left;">86</td><td>2013-Summer</td></tr>'+\
        '<tr><td style="text-align:center;" height="30">10420844</td>'+\
        '<td style="text-align:left;">CourseB</td>'+\
        '<td style="text-align:center;">4</td>'+\
        '<td style="text-align:left;">98</td><td>2013-Autumn</td></tr>'
pattern=re.compile('<tr>.*"30">(.*)</td>.*"text-align:left;">(.*)</td>.*"text-align:center;">(.*)</td>.*"text-align:left;">(.*)</td><td>(.*)</td></tr>')
items=re.findall(pattern,content)
print items

The output is:

[('10420844', 'courseB', '4', '98', '2013-Autumn')]

But the expected result is:

[('12090043', 'courseA', '3', '86', '2013-Summer'),('10420844', 'courseB', '4', '98', '2013-Autumn')]

Actually this code only returns the last match, if there are more than 2 matches. Can anyone tell me why is this happening? Sorry for the long code and thanks in advance!

Simon
  • 3
  • 1
  • 4
    No, don't use RegEx parse HTML. – Remi Guan Feb 06 '16 at 08:43
  • 1
    Along with what Kevin said - [read this famous post](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – OneCricketeer Feb 06 '16 at 08:46
  • Thanks. Then what should i do to find all the matches? Actually I can convert HTML to str, so i'm still wondering what's wrong here. – Simon Feb 06 '16 at 08:48
  • 1
    Your HTML is a str in Python. There is no conversion to be done. You simply need to use a proper HTML parser as HTML is not a regular language, so regex is not the tool to use for it. – OneCricketeer Feb 06 '16 at 08:54
  • Either use an html parser, or one of the many excellent xml libraries available in python. If you work with one that allows xpath, you can probably even write very clean code that you can look at and understand instantly when you have to come back to it in 6 months (something that would never be the case with a regex solution). – Matthew Feb 06 '16 at 08:54

2 Answers2

2

You can do this with BeautifulSoup like below:

>>> from bs4 import BeautifulSoup
>>> content = """
... <tr>
...     <td style="text-align:center;" height="30">12090043</td>
...     <td style="text-align:left;">CourseA</td>
...     <td style="text-align:center;">3</td>
...     <td style="text-align:left;">86</td><td>2013-Summer</td>
... </tr>
... 
... <tr>
...     <td style="text-align:center;" height="30">10420844</td>
...     <td style="text-align:left;">CourseB</td>
...     <td style="text-align:center;">4</td>
...     <td style="text-align:left;">98</td><td>2013-Autumn</td>
... </tr>
... """
>>> 
>>> soup = BeautifulSoup(content, "html.parser")
>>> [i.get_text(' ').split() for i in soup.find_all('tr')]
[['12090043', 'CourseA', '3', '86', '2013-Summer'], ['10420844', 'CourseB', '4', '98', '2013-Autumn']]

RegEx isn't the correct tool to parse HTML. Don't try to debug your code, instead, totally drop it and use a HTML parser like the above example (BeautifulSoup).

Remi Guan
  • 21,506
  • 17
  • 64
  • 87
  • Thank you so much. But why is there a letter "u" before each element? like [u'12090043',u'CourseA',u'3']. Thank you for your time!! – Simon Feb 06 '16 at 09:19
  • @Simon: I think you're using Python 2, it means that the output is unicode string. Please see: [What does the 'u' symbol mean in front of string values?](http://stackoverflow.com/questions/11279331/what-does-the-u-symbol-mean-in-front-of-string-values). Oh, also please remember accpet this answer if it's helpful. See also: [How does accepting an answer work?](http://meta.stackexchange.com/questions/5234/how-does-accepting-an-answer-work) – Remi Guan Feb 06 '16 at 09:23
  • Accepted. Thank you for help. – Simon Feb 06 '16 at 09:30
1

Here is a solution using ElementTree

content="""
    <tr><td style="text-align:center;" height="30">12090043</td>
    <td style="text-align:left;">CourseA</td>
    <td style="text-align:center;">3</td>
    <td style="text-align:left;">86</td><td>2013-Summer</td></tr>
    <tr><td style="text-align:center;" height="30">10420844</td>
    <td style="text-align:left;">CourseB</td>
    <td style="text-align:center;">4</td>
    <td style="text-align:left;">98</td><td>2013-Autumn</td></tr>
"""

import xml.etree.ElementTree as ET
root = ET.fromstring("<table>%s</table>"%content)
items = [tuple(col.text for col in row.findall("./td")) for row in root.findall("./tr")]

Here, items will contain

[('12090043', 'CourseA', '3', '86', '2013-Summer'), ('10420844', 'CourseB', '4', '98', '2013-Autumn')]

As we need valid xml for this library, we need to wrap your content in an outer element, so we use <table>%s</table>. The name of this element really doesn't matter; I used table as your data looks like it comes from an html table. Anything could have been used, because we select the immediate child nodes (a different xpath expression may put restrictions on what we can use to avoid conflicts).

Once we have read the data into an ElementTree, we can use findall with the xpath expression ./tr, which finds all tr elements in the content. For each of these, we use ./td to find the td elements. The text attribute of these gets the content of them as text. The call to tuple is to match the OP's desired output which uses a tuple.

More powerful xml libraries exist (lxml for instance), and ElementTree has limited xpath support, but it is sufficient for this problem, and it has the advantage of being in the standard library.

Matthew
  • 7,440
  • 1
  • 24
  • 49
  • Just curious. You say you used `table` to make valid XML. Wouldn't it make sense to use a tag name that isn't an HTML tag to not potentially conflict with anything? For example, would `%s` have worked? – OneCricketeer Feb 06 '16 at 09:13
  • As I said, I used it just because the input looked like an html table, but as stated, anything could have been used. In this case, because we are going to select the immediate child _tr_ nodes, I could have used anything there without a threat of a conflict (even _tr_). – Matthew Feb 06 '16 at 09:17
  • Gotcha. I've just rarely used Xpath or the ElementTree library, which is why I asked – OneCricketeer Feb 06 '16 at 09:19
  • Thanks a lot. I'll try this later. :) – Simon Feb 06 '16 at 09:23