-1

I am trying to make the code only get every thing in between the <p> tags. I haven't found a way yet.

I've tried to use a simple loop, and this porgramme you are suppose to enter an url and when you run it shows the plain text.

    import urllib.request
    import urllib.parse
    import re

    print("Enter the URL")
    url = input()

    #url = "https://en.wikipedia.org/wiki/Somalia"
    values = {'s':'basic', 'submit':'search'}
    data = urllib.parse.urlencode(values)
    data = data.encode('utf-8')
    req = urllib.request.Request(url,data)
    resp = urllib.request.urlopen(req)
    respData = resp.read()

    #print(respData)

    paragraphs = re.findall(r'<p>(.*?)</p>', str(respData))

    for eachP in paragraphs:
        print(eachP)

I have also tried to use BeutifulSoup and haven't even managed to import it.

  • 2
    Possible duplicate of [How to find all text inside

    elements in an HTML page using BeautifulSoup](https://stackoverflow.com/questions/10113702/how-to-find-all-text-inside-p-elements-in-an-html-page-using-beautifulsoup)

    – Aaron_ab Jan 24 '19 at 07:55
  • https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags?noredirect=1&lq=1 – Rahul Jan 24 '19 at 07:57
  • Regular expressions are NOT the way to parse XML, even if they would work for simple ones. It is strongly advised to use an XML parser. – guidot Jan 24 '19 at 07:59
  • I there any simpler way besides XML? –  Jan 24 '19 at 08:05

1 Answers1

1

Welcome to SO and programming. You can't parse [X]HTML with regex. Time to use libraries. Beautiful Soup and is your requests are your best friends here.

in your bash/cmd/terminal type:

pip install requests
pip install beautifulsoup4

Then use:

import requests
from bs4 import BeautifulSoup


r = requests.get("https://en.wikipedia.org/wiki/Somalia")
soup = BeautifulSoup(r.text) # you need to define the parser but for now its ok.
for p in soup.find_all('p'):
    print(p.text)
Rahul
  • 10,830
  • 4
  • 53
  • 88