0

I have a json values that I need stripped of all html tags.

After using the following function:

def payloaded():
from urllib.request import urlopen
with urlopen("www.example.com/payload.json") as r:
    data = json.loads(r.read().decode(r.headers.get_content_charset("utf-8")))
text = (data["body"]["und"][0]["value"])
return(text)

This is the returned (text):

&lt;div class=&quot;blah&quot;&gt;'<p>This is the text.</p>\r\n'

This is the original (text):

<div class="blah"><p>This is the text.</p>

Note: I need all html tags stripped, and there is no real guidelines of what the tags I will be getting.

This is what I want the (text) to be:

This is the text.

This is the post function I am using:

def add_node_basic(text)
url = "www.example.com"
headers = {"content-type": "application/json"}
payload = {
    "auth_token": x,
    "docs":
        {
            "id": y,
            "fields": [
                {"name": "body", "value": text, "type": "text"},
            ]}
}

r = requests.post(url, data=json.dumps(payload), headers=headers)

Any suggestions on how to achieve this is much appreciated!

avorter
  • 119
  • 1
  • 2
  • 10
  • What is the input json file? – Paul Rooney May 13 '16 at 05:34
  • 1
    Hopefully this is what you are looking for @PaulRooney `def add_node_basic(text) url = "www.example.com" headers = {"content-type": "application/json"} payload = { "auth_token": x, "docs": { "id": y, "fields": [ {"name": "body", "value": text, "type": "text"}, ]} } r = requests.post(url, data=json.dumps(payload), headers=headers)` You may be looking for this: The original text `

    This is the text.

    ...`
    – avorter May 13 '16 at 05:41
  • 1
    I went ahead and added it to the main section. Apologies for the confusion. – avorter May 13 '16 at 05:53

1 Answers1

0

You can try slicing the string along with find method, like this:

>>> print text[text.find('<p>'):text.find('</p>')].strip('<p>')
This is the text.

If you are trying to extract text only from the HTML source, then you can use HTMLParser library in Python. Example:

from HTMLParser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.fed = []
    def handle_data(self, d):
        self.fed.append(d)
    def get_data(self):
        return ''.join(self.fed)

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()
Sameer Mirji
  • 2,135
  • 16
  • 28
  • I should have mentioned that there are various other "tags" that are not similar throughout the 1000+ pages I will be running this through. I would need a way to strip all html leaving me with only the good text. This seems like it might work for some one off use cases though. – avorter May 13 '16 at 05:45
  • In that case, your question is a possible duplicate of [this](http://stackoverflow.com/questions/753052/strip-html-from-strings-in-python). {Hint: Use [HTMLParser](https://docs.python.org/2/library/htmlparser.html)} Updated it in my answer as well. – Sameer Mirji May 13 '16 at 05:59