0

I have a html file (pulled via curl; to avoid pinging the website with my trials), which contains dog listing, and where I am interested in the h3 tag contents, which is the dog's name.

from urllib.request import urlopen
from bs4 import BeautifulSoup

# read from previously saved file
url  = "petrescue_short.html"
page = open(url)
soup = BeautifulSoup(page.read(), "html.parser")

# print all h3 tags; find_all returns a list! (not array)
h3_headers = soup.find_all(['h3'])
print('List all h3 header tags :', *h3_headers, sep='\n\n')

This will provide the result of:

<h3>
dog1
</h3>

<h3>
dog2
</h3>

...

However, I want to get rid of the tags or at least of the newlines, and tried all sorts of things that ended up in an error message TypeError: 'NoneType' object is not callable. I also read this: How to modify list entries during for loop? but the list shown there is actually an array.

I sort of understand that list are not arrays, but isn't there a way to iterate through the list (which I can do) AND if I cannot change the list item, at least assign it to another variable and modify it?

I would have thought the following should work:

for i in range(len(h3_headers)):
    h3_item = h3_headers[i]
    h3_item = h3_item.replace('\n', '')
    print(h3_item, sep='\n')

How can I achieve the following:

<h3>dog1</h3>
<h3>dog2</h3>
<h3>...</h3>
MaxG
  • 187
  • 3
  • 11
  • Cant you just loop `h3_headers` and get the dog names using `.text`? Something like `dogs = [each.text for each in h3_headers]` – West Dec 10 '20 at 08:00

4 Answers4

0

you can simply try to catch this tag within a regex, something like this would work

>>>import re
>>> temp = """<h3>
... dog1
... </h3>
... 
... <h3>
... dog2
... </h3>"""
>>> temp = temp.replace("\n", "")
>>> re.findall(r'<h3>(.*?)</h3>', temp, re.MULTILINE)
['dog1', 'dog2']
>>> 
  • results in : `AttributeError: ResultSet object has no attribute 'replace'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?` – MaxG Dec 10 '20 at 09:40
  • ResultSet is a list , I took temp as a string , in case you have temp as a list , you can do "".join(temp) and than call findall on it – shubham tripathi Dec 14 '20 at 14:36
0

First thing is when you are printing your data using this, print('List all h3 header tags :', *h3_headers, sep='\n\n')remove (, sep='\n\n') from print

ashhad ullah
  • 116
  • 4
  • :) that's not the issue `sep='\n\n'` adds 2 newlines to the output of each name in tags. It is not responsible of the newlines in the source. – MaxG Dec 10 '20 at 09:38
0

Strings in python are copied by value and not by reference. This means that when you write a = values[1] a is now a copy of values[1], so changing a does not change values[1]. Instead of this you should modify your list directly, for example:

for i in range(len(h3_headers)):
    h3_item = h3_headers[i].replace('\n', '') ## a changed copy of h3_headers[i]
    h3_headers[i] = h3_item ## now the list is modified
    print(h3_item, sep='\n')

Output:

<h3>dog1</h3>
<h3>dog2</h3>
MennoK
  • 436
  • 3
  • 10
  • hmm, I get `TypeError: 'NoneType' object is not callable` when trying this code. – MaxG Dec 10 '20 at 09:33
  • Thats weird. Here is the full code: `h3_headers = [] h3_headers.append("""

    dog1

    """) h3_headers.append("""

    dog2

    """) for i in range(len(h3_headers)): h3_item = h3_headers[i].replace('\n', '') ## a changed copy of h3_headers[i] h3_headers[i] = h3_item print(h3_item, sep='\n')`. I ran it on python version 3.7.7 on a windows machine.
    – MennoK Dec 10 '20 at 10:05
0

adding .text in the second line did it...

for i in range(len(h3_headers)):
    h3_item = h3_headers[i].text
    h3_item = h3_item.replace('\n', '')
    print(h3_item, sep='\n')

thanks :)

MaxG
  • 187
  • 3
  • 11