Python String Extract between

Question

I am scraping a website using python and BeautifulSoup. I was able to find all the tds on the page with the command:

data = soup.find_all('td')

Then I find the first individual td that I need to use:

td = data[19]

If I print this td the output is:

<td data-geoid="0617568" data-isnumeric="1" data-srcnote="true" data-value="18.8">
<span data-title="Culver City city, California"></span><div class="qf-sourcenote">
<span></span><a title="Source: 2018 American Community Survey (ACS), 5-year estimates. Estimates are not comparable to other geographic levels due to methodology differences that may exist between different data sources."></a>
</div>18.8%</td>

Now I want to extract the data that is between the end of the div and the end of the td, so the 18.8%. I used this post to try to extract it with the following code:

m = re.search('</div>(.+?)</td>', td)

This gives me the following error:

Traceback (most recent call last):
  File "/Users/Alfie/PycharmProjects/474scrape/srape.py", line 18, in <module>
    m = re.search('</div>(.+?)</td>', td)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/re.py", line 183, in search
return _compile(pattern, flags).search(string)
TypeError: expected string or bytes-like object

I think the problem is with escape characters or something similar that are in the markers I am using. Any help is appreciated

score 1 · Accepted Answer · answered Apr 23 '20 at 20:57

td is probably not of type str.

If td were of type str, the code should work just fine.

import re

td = """
<td data-geoid="0617568" data-isnumeric="1" data-srcnote="true" data-value="18.8">
<span data-title="Culver City city, California"></span><div class="qf-sourcenote">
<span></span><a title="Source: 2018 American Community Survey (ACS), 5-year estimates. Estimates are not comparable to other geographic levels due to methodology differences that may exist between different data sources."></a>
</div>18.8%</td>
"""

m = re.search(r'</div>(.+?)</td>', td)
print(m.group(1))
# 18.8%

Try replacing

m = re.search(r'</div>(.+?)</td>', td)

with

m = re.search(r'</div>(.+?)</td>', str(td))

score 1 · Answer 2 · answered Apr 23 '20 at 20:59

1

Try passing pattern as a raw string.

m = re.search(r'</div>(.+?)</td>', td)

If this doesn't work, check type of td and if it's not a string, then convert it to string and then pass to the function.

answered Apr 23 '20 at 20:59

Devansh Soni

771
5
16

Python String Extract between

2 Answers2