2

I'm trying to scrape text from a website, but specifically only the text that's linked to with one of two specific links, and then additionally scrape another text string that follows shortly after it.

The second text string is easy to scrape because it includes a unique class I can target, so I've already gotten that working, but I haven't been able to successfully scrape the first text (with the one of two specific links).

I found this SO question ( Find specific link w/ beautifulsoup ) and tried to implement variations of that, but wasn't able to get it to work.

Here's a snippet of the HTML code I'm trying to scrape. This patter recurs repeatedly over the course of each page I'm scraping:

<em>[<a href="forum.php?mod=forumdisplay&fid=191&amp;filter=typeid&amp;typeid=19">女孩</a>]</em> <a href="thread-443414-1-1.html" onclick="atarget(this)" class="s xst">寻找2003年出生2004年失踪贵州省黔西南布依族苗族自治州贞丰县珉谷镇锅底冲  黄冬冬289179</a>

The two parts I'm trying to scrape and then store together in a list are the two Chinese-language text strings.

The first of these, 女孩, which means female, is the one I haven't been able to scrape successfully.

This is always preceded by one of these two links: forum.php?mod=forumdisplay&fid=191&amp;filter=typeid&amp;typeid=19 (Female) forum.php?mod=forumdisplay&fid=191&amp;filter=typeid&amp;typeid=15 (Male)

I've tested a whole bunch of different things, including things like:

gender_containers = soup.find_all('a', href = 'forum.php?mod=forumdisplay&fid=191&amp;filter=typeid&amp;typeid=19')

print(gender_containers.get_text())

But for everything I've tried, I keep getting errors like:

ResultSet object has no attribute 'get_text'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?

I think that I'm not successfully finding those links to grab the text, but my rudimentary Python skills thus far have failed me in figuring out how to make it happen.

What I want to have happen ultimately is to scrape each page such that the two strings in this code (女孩 and 寻找2003年出生2004年失踪贵州省...)

<em>[<a href="forum.php?mod=forumdisplay&fid=191&amp;filter=typeid&amp;typeid=19">女孩</a>]</em> <a href="thread-443414-1-1.html" onclick="atarget(this)" class="s xst">寻找2003年出生2004年失踪贵州省黔西南布依族苗族自治州贞丰县珉谷镇锅底冲  黄冬冬289179</a>

...are scraped as two separate variables so that I can store them as two items in a list and then iterate down to the next instance of this code, scrape those two text snippets and store them as another list, etc. I'm building a list of list in which I want each row/nested list to contain two strings: the gender (女孩 or 男孩)and then the longer string, which has a lot more variation.

(But currently I have working code that scrapes and stores that, I just haven't been able to get the gender part to work.)

QHarr
  • 83,427
  • 12
  • 54
  • 101
custerc
  • 23
  • 4

2 Answers2

0

Try the following code.

from bs4 import BeautifulSoup
data='''<em>[<a href="forum.php?mod=forumdisplay&fid=191&amp;filter=typeid&amp;typeid=19">女孩</a>]</em> <a href="thread-443414-1-1.html" onclick="atarget(this)" class="s xst">寻找2003年出生2004年失踪贵州省黔西南布依族苗族自治州贞丰县珉谷镇锅底冲  黄冬冬289179</a>'''
soup=BeautifulSoup(data,'html.parser')
print(soup.select_one('em').text)

OutPut:

[女孩]
KunduK
  • 32,888
  • 5
  • 17
  • 41
0

Sounds like you could use attribute = value css selector with $ ends with operator

If there can only be one occurrence per page

soup.select_one("[href$='typeid=19'], [href$='typeid=15']").text 

This is assuming those typeid=19 or typeid=15 only occur at the end of the strings of interest. The "," between the two in the selector is to allow for matching on either.

You could additionally handle possibility of not being present as follows:

from bs4 import BeautifulSoup
html ='''<em>[<a href="forum.php?mod=forumdisplay&fid=191&amp;filter=typeid&amp;typeid=19">女孩</a>]</em> <a href="thread-443414-1-1.html" onclick="atarget(this)" class="s xst">寻找2003年出生2004年失踪贵州省黔西南布依族苗族自治州贞丰县珉谷镇锅底冲  黄冬冬289179</a>'''
soup=BeautifulSoup(html,'html.parser')
gender = soup.select_one("[href$='typeid=19'], [href$='typeid=15']").text if soup.select_one("[href$='typeid=19'], [href$='typeid=15']") is not None else 'Not found'
print(gender)

Multiple values:

genders = [item.text for item in soup.select_one("[href$='typeid=19'], [href$='typeid=15']")]
QHarr
  • 83,427
  • 12
  • 54
  • 101
  • This works, but how could I get it to find that more than once? There are multiple instances of that code on the page. Here's what I have now, but it's just pulling the first example it finds rather than all ```python link_containers = soup.find_all('a', class_ = 's xst') # isolate the forum title links gender_containers = soup.find_all('em') for link in gender_containers: gender = soup.find_all("[href$='typeid=19'], [href$='typeid=15']").text for link in link_containers: link_text = link.get_text() testlist1.append([link_text, gender]) ``` – custerc May 06 '19 at 02:28
  • do you need to group them with other info or just return a list of them? I've shown returning a list above. Can update further of course. – QHarr May 06 '19 at 04:30