I'm using REGEX to compile a list of strings from an HTML doc in Python. The strings are either found inside a td tag, or inside a div tag. I am having problem properly using the REGEX OR to prevent the following problem from happening. If I use:
FindStrings= re.compile('<td>(.*?)</td>|padding:0;">(.*?)</div>')
MyStrings = re.findall(FindStrings, str(soup))
print MyStrings
I would get something like:
[('apple', ''), ('sky', ''), ('red', ''), ('', 'summer'), ('', 'pizza')]
I get that the strings on the left side of the brackets are found with <td>(.*?)</td>
and the one on the right side are found with </td>|padding:0;">(.*?)</div>
. I would like to know what should be added to the REGEX to get a final list like the one bellow:
['apple', 'sky', 'red', 'summer', 'pizza']