OR statement and REGEX in Python

Question

I'm using REGEX to compile a list of strings from an HTML doc in Python. The strings are either found inside a td tag, or inside a div tag. I am having problem properly using the REGEX OR to prevent the following problem from happening. If I use:

FindStrings= re.compile('<td>(.*?)</td>|padding:0;">(.*?)</div>')
MyStrings = re.findall(FindStrings, str(soup))
print MyStrings

I would get something like:

[('apple', ''), ('sky', ''), ('red', ''), ('', 'summer'), ('', 'pizza')]

I get that the strings on the left side of the brackets are found with <td>(.*?)</td> and the one on the right side are found with </td>|padding:0;">(.*?)</div>. I would like to know what should be added to the REGEX to get a final list like the one bellow:

['apple', 'sky', 'red', 'summer', 'pizza']

score 4 · Answer 1 · edited May 23 '17 at 10:25

Do not use regex for parsing HTML. There are specialized tools for working with the HTML format.

Example using BeautifulSoup package:

from bs4 import BeautifulSoup

data = """
<body>
    <table>
        <tr>
            <td>apple</td>
            <td>sky</td>
        </tr>
        <tr>
            <td>red</td>
        </tr>
    </table>
    <div>summer</div>
    <div>pizza</div>
</body>
"""

soup = BeautifulSoup(data)
print [item.text for item in soup.find_all(['td', 'div'])]

Prints:

[u'apple', u'sky', u'red', u'summer', u'pizza']

score 2 · Accepted Answer · answered Oct 11 '14 at 21:04

2

Regardless of how you're parsing/using regex you can use Python's itertools after you've got your list:

import itertools

item_list = [("apple", ""), ("sky", ""), ("red", ""), ("", "summer"), ("", "pizza")]
print(item_list)

flat_list = list(itertools.chain(*item_list))
result = filter(None, flat_list)
print(result)

Output:

[('apple', ''), ('sky', ''), ('red', ''), ('', 'summer'), ('', 'pizza')]
['apple', 'sky', 'red', 'summer', 'pizza']

answered Oct 11 '14 at 21:04

l'L'l

44,951
10
95
146

Pfff. *This* got accepted? That could have been accomplished by a simple `[ ''.join(x) for x in item_list ]` without any `itertools` or whatever. – Alfe Oct 11 '14 at 21:10
Yes, it did get accepted — and itertools is faster. – l'L'l Oct 11 '14 at 21:20

score 0 · Answer 3 · answered Oct 11 '14 at 20:32

0

You can process the result of the regex to the way that you want.
Something like this -

#Result of regex in MyStrings
>>> MyStrings = [('apple', ''), ('sky', ''), ('red', ''), ('', 'summer'), ('', 'pizza')]
>>> [s[0] if s[1]=='' else s[1] for s in MyStrings]
['apple', 'sky', 'red', 'summer', 'pizza']

answered Oct 11 '14 at 20:32

Kamehameha

5,423
1
23
28

No, he wants to extract directly the proper results, via regex – DevLounge Oct 11 '14 at 20:34

OR statement and REGEX in Python

3 Answers3