Conversion of a webpage to text

Question

I was trying to convert a webpage to text and save it in a txtfile. I coded the following scripts in python. It is working but the text quality is not workable.

Is there any way the text file would be of much better quality and meaning could be realized.

Here's my code:

import webbrowser
new = 2
#url = "http://google.com"
url = 'https://www.uniprot.org/uniprot/'
webbrowser.open(url, new = new)


from urllib.request import Request, urlopen

#url = "http://wikipedia.org"
#url = "https://www.uniprot.org/"
url = 'http://www.uniprot.org/'

req = Request(url, headers={'user-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
with open('<filepath>//web2txt_2.txt','wb') as out:
    out.write(webpage)
print(webpage)

The result I get is also given below, truncated for legibility.

b'<!DOCTYPE html SYSTEM "about:legacy-compat">\n<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en"><head><title>UniProt</title><meta content="IE=edge" http-equiv="X-UA-Compatible"/><meta content="text/html; charset=UTF-8" http-equiv="Content-Type"/><meta content="width=device-width, initial-scale=1" name="viewport"/><link href="/" rel="home"/><link href="https://creativecommons.org/licenses/by/4.0/" rel="license"/><link type="image/vnd.microsoft.icon" href="/favicon.ico" rel="shortcut icon"/><link href="/uniprot.min.css2021_01" type="text/css" rel="stylesheet"/><link href="/tippy.css" type="text/css" rel="stylesheet"/><script type="text/javascript">\n\t\t\tvar BASE = \'/\';\n\t\t</script><script src="/js-compr.js2021_01" type="text/javascript"></script><script type="text/javascript">\n\t\t\t\tuniprot.isInternal = false;\n\t\t\t\tuniprot.namespace = \'uniprot\';\n\t\t\t\tuniprot.releasedate = \'2021_01\';\n\t\t\t</script><script type="text/javascript">\n\t\t\t;\n\t\t</script><meta content="0B002D36E9BAD5BA205A5DCAC3FD9E08" name="msvalidate.01"/><meta content="nositelinkssearchbox" name="google"/></head><body class="namespace-homepage" typeof="WebPage" prefix="up: http://purl.uniprot.org/core/" vocab="http://schema.org/"><span id="evidenceToolTip" style="display:none">&#xd;\n                                    &lt;p>An evidence describes the source of an annotation, e.g. an experiment that has been published in the scientific literature, an orthologous protein, a record from another database, etc.&lt;/p>&#xd;\n&#xd;\n&lt;p>&lt;a href="/manual/evidences">More...&lt;/a>&lt;/p>&#xd;\n                                </span><p style="display:none"><a accesskey="2" href="#content">Skip Header</a></p><div id="masthead-container"><div class="masthead" id="local-masthead"><div id="local-title"><a id="logo" accesskey="1" href="/"><img alt="" src="/images/logos/Logo_medium.png" title="UniProt home"/></a></div><div class="namespace-uniprot" id="local-search"><form method="get" action="/uniprot" id="search-form"><div id="namespace-background"><div class="searchBoxIndicator" style="display:none" id="searchBoxIndicator1">\xc2\xa0</div><div onclick="location.href=&apos;/help/text-search&apos;;" class="searchBoxIndicator" style="display:none" id="searchBoxIndicator2">\xc2\xa0</div><div onclick="location.href=&apos;/help/advanced_search &apos;;" class="searchBoxIndicator" style="display:none" id="searchBoxIndicator3">\xc2\xa0</div><a class="namespace-select" id="select-namespace" onclick="return false;" href=""><span class="caret_white" id="selected-namespace">UniProtKB</span></a><ul style="display:none" class="select-namespace-options"><a href="#" class="closeBox" id="closeNamespaceOptions">x</a><li><ul><li class="fixedHeight_namespaces"><h3 class="namespace_uniprot"><a class="namespace-option uniprot" href="#" id="uniprot">UniProtKB</a></h3><p>Protein knowledgebase</p></li><li class="fixedHeight_namespaces"><h3 class="namespace_uniparc"><a class="namespace-option uniparc" href="#" id="uniparc">UniParc</a></h3><p>Sequence archive</p></li><li class="fixedHeight_namespaces"><h3 class="namespace_help"><a class="namespace-option help" href="#" id="help">Help</a></h3><p>Help pages, FAQs, UniProtKB manual, documents, news archive and Biocuration projects.</p></li></ul></li><li><ul><li class="fixedHeight_namespaces"><h3 class="namespace_uniref"><a class="namespace-option uniref" href="#" id="uniref">UniRef</a></h3><p>Sequence clusters</p></li><li class="fixedHeight_namespaces"><h3 class="namespace_proteomes"><a class="namespace-option proteomes" href="#" id="proteomes">Proteomes</a></h3><p>Protein sets from fully sequenced genomes</p></li><li class="fixedHeight_namespaces"><h3 class="namespace_unirule">Annotation systems</h3>
<p class="supportingLeadingText">Systems used to automatically annotate proteins with high accuracy:</p><ul class="supportingDataOptions"><li><a class="namespace-option supporting" href="#" id="unirule">UniRule (Expertly curated rules)</a></li><li><a class="namespace-option supporting" href="#" id="arba">ARBA (System generated rules)</a></li></ul></li></ul></li><li><ul><li class="fixedHeight_namespaces"><h3 class="supporting"><span>Supporting data</span></h3><p class="supportingLeadingText">Select one of the options below to target your search:</p><ul class="supportingDataOptions"><li><a class="namespace-option supporting" href="#" id="citations">Literature citations</a></li><li><a class="namespace-option supporting" href="#" id="taxonomy">Taxonomy</a></li><li><a class="namespace-option supporting" href="#" id="keywords">Keywords</a></li><li><a class="namespace-option supporting" href="#" id="locations">Subcellular locations</a></li><li><a class="namespace-option supporting" href="#" id="database">Cross-referenced databases</a></li><li><a class="namespace-option supporting" href="#" id="diseases">Human diseases</a></li></ul></li></ul></li></ul><input value="" id="topQuery" type="hidden"/><div id="queryContainer"><input autocomplete="off" autofocus="autofocus" id="query" value="" accesskey="4" name="query" type="search"/></div><a class="caret_grey" href="#" id="advanced-search-toggle">Advanced</a><input value="score" name="sort" type="hidden"/><a id="search-button" title="Search" data-icon="1" class="icon icon-functional button">Search</a><div style="display:none" class="advSearch" id="query-builder-container">
...

There is a text format meant for processing for every UniProt entry. Just add '.txt' to the url. This will give you a better start. Also try extracting the schema.org markup instead, or use one of the other structured data formats such as RDF or XML. — Jerven, Feb 21 '21 at 09:06

score 0 · Accepted Answer · answered Feb 18 '21 at 03:30

Use the BeautifulSoup library together with the requests library will get you at least to a good start. Once you study the first library some more, you might be able to customize your program to extract only the text you want. See below.

Also, please see here for a similar question: Text Extracting: Used All Methods, Yet Stuck.

import requests
from bs4 import BeautifulSoup

url = 'http://www.uniprot.org/'
content = requests.get(url)
soup = BeautifulSoup(content.text)
print(soup.text)

(Partial) output, hand-picked:

...
          The mission of UniProt is to provide
          the scientific community with a comprehensive, high-quality and
          freely accessible resource of protein sequence and functional
          information.
        UniProtKBUniProt KnowledgebaseSwiss-Prot (564,277)Manually annotated and reviewed.Records with 
...

Conversion of a webpage to text

1 Answers1