0

I'm using REGEX to compile a list of strings from an HTML doc in Python. The strings are either found inside a td tag, or inside a div tag. I am having problem properly using the REGEX OR to prevent the following problem from happening. If I use:

FindStrings= re.compile('<td>(.*?)</td>|padding:0;">(.*?)</div>')
MyStrings = re.findall(FindStrings, str(soup))
print MyStrings

I would get something like:

[('apple', ''), ('sky', ''), ('red', ''), ('', 'summer'), ('', 'pizza')]

I get that the strings on the left side of the brackets are found with <td>(.*?)</td> and the one on the right side are found with </td>|padding:0;">(.*?)</div>. I would like to know what should be added to the REGEX to get a final list like the one bellow:

['apple', 'sky', 'red', 'summer', 'pizza']
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
LaGuille
  • 1,658
  • 5
  • 20
  • 37

3 Answers3

4

Do not use regex for parsing HTML. There are specialized tools for working with the HTML format.

Example using BeautifulSoup package:

from bs4 import BeautifulSoup

data = """
<body>
    <table>
        <tr>
            <td>apple</td>
            <td>sky</td>
        </tr>
        <tr>
            <td>red</td>
        </tr>
    </table>
    <div>summer</div>
    <div>pizza</div>
</body>
"""

soup = BeautifulSoup(data)
print [item.text for item in soup.find_all(['td', 'div'])]

Prints:

[u'apple', u'sky', u'red', u'summer', u'pizza']
Community
  • 1
  • 1
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
2

Regardless of how you're parsing/using regex you can use Python's itertools after you've got your list:

import itertools

item_list = [("apple", ""), ("sky", ""), ("red", ""), ("", "summer"), ("", "pizza")]
print(item_list)

flat_list = list(itertools.chain(*item_list))
result = filter(None, flat_list)
print(result)

Output:

[('apple', ''), ('sky', ''), ('red', ''), ('', 'summer'), ('', 'pizza')]
['apple', 'sky', 'red', 'summer', 'pizza']
l'L'l
  • 44,951
  • 10
  • 95
  • 146
  • Pfff. *This* got accepted? That could have been accomplished by a simple `[ ''.join(x) for x in item_list ]` without any `itertools` or whatever. – Alfe Oct 11 '14 at 21:10
  • Yes, it did get accepted — and itertools is faster. – l'L'l Oct 11 '14 at 21:20
0

You can process the result of the regex to the way that you want.
Something like this -

#Result of regex in MyStrings
>>> MyStrings = [('apple', ''), ('sky', ''), ('red', ''), ('', 'summer'), ('', 'pizza')]
>>> [s[0] if s[1]=='' else s[1] for s in MyStrings]
['apple', 'sky', 'red', 'summer', 'pizza']
Kamehameha
  • 5,423
  • 1
  • 23
  • 28