I was trying to convert a webpage to text and save it in a txtfile. I coded the following scripts in python. It is working but the text quality is not workable.
Is there any way the text file would be of much better quality and meaning could be realized.
Here's my code:
import webbrowser
new = 2
#url = "http://google.com"
url = 'https://www.uniprot.org/uniprot/'
webbrowser.open(url, new = new)
from urllib.request import Request, urlopen
#url = "http://wikipedia.org"
#url = "https://www.uniprot.org/"
url = 'http://www.uniprot.org/'
req = Request(url, headers={'user-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
with open('<filepath>//web2txt_2.txt','wb') as out:
out.write(webpage)
print(webpage)
The result I get is also given below, truncated for legibility.
b'<!DOCTYPE html SYSTEM "about:legacy-compat">\n<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en"><head><title>UniProt</title><meta content="IE=edge" http-equiv="X-UA-Compatible"/><meta content="text/html; charset=UTF-8" http-equiv="Content-Type"/><meta content="width=device-width, initial-scale=1" name="viewport"/><link href="/" rel="home"/><link href="https://creativecommons.org/licenses/by/4.0/" rel="license"/><link type="image/vnd.microsoft.icon" href="/favicon.ico" rel="shortcut icon"/><link href="/uniprot.min.css2021_01" type="text/css" rel="stylesheet"/><link href="/tippy.css" type="text/css" rel="stylesheet"/><script type="text/javascript">\n\t\t\tvar BASE = \'/\';\n\t\t</script><script src="/js-compr.js2021_01" type="text/javascript"></script><script type="text/javascript">\n\t\t\t\tuniprot.isInternal = false;\n\t\t\t\tuniprot.namespace = \'uniprot\';\n\t\t\t\tuniprot.releasedate = \'2021_01\';\n\t\t\t</script><script type="text/javascript">\n\t\t\t;\n\t\t</script><meta content="0B002D36E9BAD5BA205A5DCAC3FD9E08" name="msvalidate.01"/><meta content="nositelinkssearchbox" name="google"/></head><body class="namespace-homepage" typeof="WebPage" prefix="up: http://purl.uniprot.org/core/" vocab="http://schema.org/"><span id="evidenceToolTip" style="display:none">
\n <p>An evidence describes the source of an annotation, e.g. an experiment that has been published in the scientific literature, an orthologous protein, a record from another database, etc.</p>
\n
\n<p><a href="/manual/evidences">More...</a></p>
\n </span><p style="display:none"><a accesskey="2" href="#content">Skip Header</a></p><div id="masthead-container"><div class="masthead" id="local-masthead"><div id="local-title"><a id="logo" accesskey="1" href="/"><img alt="" src="/images/logos/Logo_medium.png" title="UniProt home"/></a></div><div class="namespace-uniprot" id="local-search"><form method="get" action="/uniprot" id="search-form"><div id="namespace-background"><div class="searchBoxIndicator" style="display:none" id="searchBoxIndicator1">\xc2\xa0</div><div onclick="location.href='/help/text-search';" class="searchBoxIndicator" style="display:none" id="searchBoxIndicator2">\xc2\xa0</div><div onclick="location.href='/help/advanced_search ';" class="searchBoxIndicator" style="display:none" id="searchBoxIndicator3">\xc2\xa0</div><a class="namespace-select" id="select-namespace" onclick="return false;" href=""><span class="caret_white" id="selected-namespace">UniProtKB</span></a><ul style="display:none" class="select-namespace-options"><a href="#" class="closeBox" id="closeNamespaceOptions">x</a><li><ul><li class="fixedHeight_namespaces"><h3 class="namespace_uniprot"><a class="namespace-option uniprot" href="#" id="uniprot">UniProtKB</a></h3><p>Protein knowledgebase</p></li><li class="fixedHeight_namespaces"><h3 class="namespace_uniparc"><a class="namespace-option uniparc" href="#" id="uniparc">UniParc</a></h3><p>Sequence archive</p></li><li class="fixedHeight_namespaces"><h3 class="namespace_help"><a class="namespace-option help" href="#" id="help">Help</a></h3><p>Help pages, FAQs, UniProtKB manual, documents, news archive and Biocuration projects.</p></li></ul></li><li><ul><li class="fixedHeight_namespaces"><h3 class="namespace_uniref"><a class="namespace-option uniref" href="#" id="uniref">UniRef</a></h3><p>Sequence clusters</p></li><li class="fixedHeight_namespaces"><h3 class="namespace_proteomes"><a class="namespace-option proteomes" href="#" id="proteomes">Proteomes</a></h3><p>Protein sets from fully sequenced genomes</p></li><li class="fixedHeight_namespaces"><h3 class="namespace_unirule">Annotation systems</h3>
<p class="supportingLeadingText">Systems used to automatically annotate proteins with high accuracy:</p><ul class="supportingDataOptions"><li><a class="namespace-option supporting" href="#" id="unirule">UniRule (Expertly curated rules)</a></li><li><a class="namespace-option supporting" href="#" id="arba">ARBA (System generated rules)</a></li></ul></li></ul></li><li><ul><li class="fixedHeight_namespaces"><h3 class="supporting"><span>Supporting data</span></h3><p class="supportingLeadingText">Select one of the options below to target your search:</p><ul class="supportingDataOptions"><li><a class="namespace-option supporting" href="#" id="citations">Literature citations</a></li><li><a class="namespace-option supporting" href="#" id="taxonomy">Taxonomy</a></li><li><a class="namespace-option supporting" href="#" id="keywords">Keywords</a></li><li><a class="namespace-option supporting" href="#" id="locations">Subcellular locations</a></li><li><a class="namespace-option supporting" href="#" id="database">Cross-referenced databases</a></li><li><a class="namespace-option supporting" href="#" id="diseases">Human diseases</a></li></ul></li></ul></li></ul><input value="" id="topQuery" type="hidden"/><div id="queryContainer"><input autocomplete="off" autofocus="autofocus" id="query" value="" accesskey="4" name="query" type="search"/></div><a class="caret_grey" href="#" id="advanced-search-toggle">Advanced</a><input value="score" name="sort" type="hidden"/><a id="search-button" title="Search" data-icon="1" class="icon icon-functional button">Search</a><div style="display:none" class="advSearch" id="query-builder-container">
...