0

For some reason, all of a sudden BeautifulSoup is not able to locate the content of any of my tags in a new Python script that I've begun. I have been using BeautifulSoup for about a year now, and have never encountered this problem.

I am able to successfully inject a JSON payload in Python with ".json()", pass that to BeautifulSoup using the html.parser and it wonderfully works every time.

I am now trying to read a MySql field that contains raw HTML, feed it as a text string into Python, and parse out and manipulate with BeautifulSoup, without any success.

I have gown down to trying to simply load a text string, like in this example, with the same negative result = not being able to find a tag, based on text-string-search (BeautifulSoup always returns = "None").

text_field = '<td><p></p><p></p><td><p>HELP text here 1<a href="some_URL_here"><ac:image ac:align="center" ac:layout="center" ac:original-height="153" ac:original-width="200"><ri:attachment ri:filename="image.png" ri:version-at-save="1"></ri:attachment></ac:image></a></p></td><p /><h2 style="text-align: center;"><a href="{some_URL_here}"><em><strong>Click here&hellip;</strong></em></a></h2></td>'
soup = BeautifulSoup(text_field, 'html.parser')
print(soup)
print (soup.prettify())

test = soup.find('td', text="HELP")
print(test)

The output from my "prettify" is parsed out properly by BeautifulSoup:

<td>
    <p>
    </p>
    <p>
    </p>
    <td>
        <p>
            HELP text here 1
            <a href="some_URL_here">
                <ac:image ac:align="center" ac:layout="center" ac:original-height="153" ac:original-width="200">
                    <ri:attachment ri:filename="image.png" ri:version-at-save="1">
                    </ri:attachment>
                </ac:image>
            </a>
        </p>
    </td>
    <p>
    </p>
    <h2 style="text-align: center;">
        <a href="{some_URL_here}">
            <em>
                <strong>
                    Click here…
                </strong>
            </em>
        </a>
    </h2>
</td>

But no matter what I try, BeautifulSoup is ALWAYS returning "None" from any find request.

Am I missing something obvious here?

1 Answers1

1

So figured out that I cannot do a FIND on a partial portion on a string. So instead of:

test = soup.find('td', text="HELP")

You would have to do:

test = soup.find('td', text="HELP text here 1")

You have to declare the entire sting.

In the case where you want to search for a partial string, I found the answer using RegEX trial & error, in combination with the following posts:

Beautiful Soup Find Tags based on partial attribute value

python's re: return True if string contains regex pattern

So the solution looks like this:

Real INPUT sample (Python)

INPUT = <tbody><tr><th colspan="2"><h3><strong>TITLE 1</strong></h3></th></tr><tr><td><p><strong>TITLE 2</strong></p></td><th><p><strong>File and documentation repository</strong></p></th></tr><tr><td><ac:image ac:align="center" ac:layout="center" ac:original-height="912" ac:original-width="1502"><ri:attachment ri:filename="Sample_Diagram.jpg" ri:version-at-save="1"></ri:attachment></ac:image></td></tr></tbody>

.... and here is Python script:

REPLACEMENT_TAG = '<ri:attachment ri:filename="new_filename.png" ri:version-at-save="1"></ri:attachment>'
    
    
soup = BeautifulSoup(INPUT, "html.parser")
    
EXTRACTED = soup.find("ri:attachment", {"ri:filename" : re.compile(r'Sample_Diagram.jpg')})
EXTRACTED.replaceWith(REPLACEMENT_TAG)

This Python code will:

  • identify TAG (i.e. "ri:attachment")
  • based on partial string (i.e. "Sample_Diagram.jpg")
  • and replace with new TAG