Deleting anything but plain Text in python

Question

I am trying to make the code only get every thing in between the <p> tags. I haven't found a way yet.

I've tried to use a simple loop, and this porgramme you are suppose to enter an url and when you run it shows the plain text.

    import urllib.request
    import urllib.parse
    import re

    print("Enter the URL")
    url = input()

    #url = "https://en.wikipedia.org/wiki/Somalia"
    values = {'s':'basic', 'submit':'search'}
    data = urllib.parse.urlencode(values)
    data = data.encode('utf-8')
    req = urllib.request.Request(url,data)
    resp = urllib.request.urlopen(req)
    respData = resp.read()

    #print(respData)

    paragraphs = re.findall(r'<p>(.*?)</p>', str(respData))

    for eachP in paragraphs:
        print(eachP)

I have also tried to use BeutifulSoup and haven't even managed to import it.

Possible duplicate of [How to find all text inside
elements in an HTML page using BeautifulSoup](https://stackoverflow.com/questions/10113702/how-to-find-all-text-inside-p-elements-in-an-html-page-using-beautifulsoup) — Aaron_ab, Jan 24 '19 at 07:55
https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags?noredirect=1&lq=1 — Rahul, Jan 24 '19 at 07:57
Regular expressions are NOT the way to parse XML, even if they would work for simple ones. It is strongly advised to use an XML parser. — guidot, Jan 24 '19 at 07:59

score 1 · Accepted Answer · answered Jan 24 '19 at 08:35

Welcome to SO and programming. You can't parse [X]HTML with regex. Time to use libraries. Beautiful Soup and is your requests are your best friends here.

in your bash/cmd/terminal type:

pip install requests
pip install beautifulsoup4

Then use:

import requests
from bs4 import BeautifulSoup


r = requests.get("https://en.wikipedia.org/wiki/Somalia")
soup = BeautifulSoup(r.text) # you need to define the parser but for now its ok.
for p in soup.find_all('p'):
    print(p.text)

Deleting anything but plain Text in python

1 Answers1