I'm currently working on a python script that searches for select data on a webpage.
For context, it looks up some word phonetics from an online dictionary, and does so for a few other similar words as well( Similar to what google transliterator does ). The problem is that each webpage needs to be downloaded completely in order for me to extract the data that i need( which unfortunately is towards the end of the webpage source).
I wanted to know if there's any way to access a specific element of a webpage, without downloading all the data.
Here is my snippet of code that currently does this:
for i in SuggestionJson['suggestions']:
webpage = requests.get("https://www.vajehyab.com" + i['link'] + "&t=like") #download whole webpage
soup = BeautifulSoup(webpage.content, 'html.parser')
phonetic = soup.find("div", {"id": "wordbox"}).section.header.h3.text.replace('/','') #extract data from div
if(phonetic != ''): #save to file
f.write(phonetic)
What i have in mind is for it to skip downloading elements like <head>
and skip every other <div>
element that doesn't match the id i want.
Is this possible?
Edit: For example say i have the following html(from ifconfig.me) code:
<!DOCTYPE html>
<html lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta http-equiv="content-style-type" content="text/css" />
<meta http-equiv="content-script-type" content="text/javascript" />
<meta http-equiv="content-language" content="en" />
<meta http-equiv="pragma" content="no-cache" />
<meta http-equiv="cache-control" content="no-cache" />
<meta name="description" content="Get my IP Address" />
<meta name="keywords" content="ip address ifconfig ifconfig.me" />
<meta name="author" content="" />
<link rel="shortcut icon" href="favicon.ico" />
<link rel="canonical" href="https://ipinfo.io/">
<title>What Is My IP Address? - ifconfig.me</title>
<meta name="viewport" content="width=device-width, initial-scale=1">
<link href="/styles/style.css" rel="stylesheet" type="text/css">
</head>
<body>
<div id="container" class="clearfix">
<div id="header">
<table>
<tr>
<td>
<h1><a href="http://ifconfig.me">What Is My IP Address? - ifconfig.me</a></h1>
</td>
<td></td>
</tr>
<tr>
<td></td>
<td>
<div id="plungins">
<div class="plungin" id="button_facebook">
<div id="fb-root"></div>
<script src="http://connect.facebook.net/en_US/all.js#xfbml=1"></script>
<fb:like href="http://ifconfig.me/" send="false" layout="button_count" width="100"
show_faces="true" font=""></fb:like>
</div>
<div class="plungin" id="button_twitter">
<a href="http://twitter.com/share" class="twitter-share-button"
data-url="http://ifconfig.me/" data-text="What Is My IP Address? - ifconfig.me
" data-count="horizontal"></a>
<script type="text/javascript" src="http://platform.twitter.com/widgets.js"></script>
</div>
<div class="plungin" id="button_plusone">
<!-- Place this tag where you want the +1 button to render -->
<g:plusone size="medium" href="http://ifconfig.me/"></g:plusone>
<!-- Place this render call where appropriate -->
<script type="text/javascript">
(function () {
var po = document.createElement('script');
po.type = 'text/javascript';
po.async = true;
po.src = 'https://apis.google.com/js/plusone.js';
var s = document.getElementsByTagName('script')[0];
s.parentNode.insertBefore(po, s);
})();
</script>
</div>
</div>
</td>
</tr>
</table>
</div>
<div id="info_area">
<h2>Your Connection</h2>
<table id="info_table" summary="info">
<tr>
<td class="info_table_label">IP Address</td>
<td id="ip_address_cell"><strong id="ip_address">2.177.115.178</strong></td>
</tr>
<tr>
<td class="info_table_label">Remote Host</td>
<td>unavailable</td>
</tr>
<tr>
<td class="info_table_label">User Agent</td>
<td>Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) snap
Chromium/74.0.3729.131 Chrome/74.0.3729.131 Safari/537.36</td>
</tr>
<tr>
<td class="info_table_label">Port</td>
<td>33966</td>
</tr>
<tr>
<td class="info_table_label">Language</td>
<td>en-US,en;q=0.9,fa-IR;q=0.8,fa;q=0.7,de;q=0.6</td>
</tr>
<tr>
<td class="info_table_label">Referer</td>
<td></td>
</tr>
<tr>
<td class="info_table_label">Connection</td>
<td></td>
</tr>
<tr>
<td class="info_table_label">KeepAlive</td>
<td></td>
</tr>
<tr>
<td class="info_table_label">Method</td>
<td>GET</td>
</tr>
<tr>
<td class="info_table_label">Encoding</td>
<td>gzip, deflate, br</td>
</tr>
<tr>
<td class="info_table_label">MIME Type</td>
<td> text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3
</td>
</tr>
<tr>
<td class="info_table_label">Charset</td>
<td></td>
</tr>
<tr>
<td class="info_table_label">Via</td>
<td>1.1 google</td>
</tr>
<tr>
<td class="info_table_label">X-Forwarded-For</td>
<td>2.177.115.178, 216.239.34.21</td>
</tr>
</table>
</div>
<!--<div id="middle"></div>-->
<div id="cli_wrap">
<h2>Command Line Interface</h2>
<table id="cli_table" summary="cli">
<tr>
<td class="cli_command">$ curl ifconfig.me</td>
<td class="cli_arrow">⇒</td>
<td>2.177.115.178</td>
</tr>
<tr>
<td class="cli_command">$ curl ifconfig.me/ip</td>
<td class="cli_arrow">⇒</td>
<td>2.177.115.178</td>
</tr>
<tr>
<td class="cli_command">$ curl ifconfig.me/host</td>
<td class="cli_arrow">⇒</td>
<td>unavailable</td>
</tr>
<tr>
<td class="cli_command">$ curl ifconfig.me/ua</td>
<td class="cli_arrow">⇒</td>
<td>Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) snap
Chromium/74.0.3729.131 Chrome/74.0.3729.131 Safari/537.36</td>
</tr>
<tr>
<td class="cli_command">$ curl ifconfig.me/port</td>
<td class="cli_arrow">⇒</td>
<td>33966</td>
</tr>
<tr>
<td class="cli_command">$ curl ifconfig.me/lang</td>
<td class="cli_arrow">⇒</td>
<td>en-US,en;q=0.9,fa-IR;q=0.8,fa;q=0.7,de;q=0.6</td>
</tr>
<tr>
<td class="cli_command">$ curl ifconfig.me/keepalive</td>
<td class="cli_arrow">⇒</td>
<td></td>
</tr>
<tr>
<td class="cli_command">$ curl ifconfig.me/connection</td>
<td class="cli_arrow">⇒</td>
<td></td>
</tr>
<tr>
<td class="cli_command">$ curl ifconfig.me/encoding</td>
<td class="cli_arrow">⇒</td>
<td>gzip, deflate, br</td>
</tr>
<tr>
<td class="cli_command">$ curl ifconfig.me/mime</td>
<td class="cli_arrow">⇒</td>
<td>text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3
</td>
</tr>
<tr>
<td class="cli_command">$ curl ifconfig.me/charset</td>
<td class="cli_arrow">⇒</td>
<td></td>
</tr>
<tr>
<td class="cli_command">$ curl ifconfig.me/via</td>
<td class="cli_arrow">⇒</td>
<td>1.1 google</td>
</tr>
<tr>
<td class="cli_command">$ curl ifconfig.me/forwarded</td>
<td class="cli_arrow">⇒</td>
<td>2.177.115.178, 216.239.34.21</td>
</tr>
<tr>
<td class="cli_command">$ curl ifconfig.me/all</td>
<td class="cli_arrow">⇒</td>
<td>
ip_addr: 2.177.115.178
<br>
remote_host: unavailable
<br>
user_agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) snap
Chromium/74.0.3729.131 Chrome/74.0.3729.131 Safari/537.36
<br>
port: 33966
<br>
language: en-US,en;q=0.9,fa-IR;q=0.8,fa;q=0.7,de;q=0.6
<br>
referer:
<br>
connection:
<br>
keep_alive:
<br>
method: GET
<br>
encoding: gzip, deflate, br
<br>
mime:
text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3
<br>
charset:
<br>
via: 1.1 google
<br>
forwarded: 2.177.115.178, 216.239.34.21
<br>
</td>
</tr>
<tr>
<td class="cli_command">$ curl ifconfig.me/all.xml</td>
<td class="cli_arrow">⇒</td>
<td><info>
<ip_addr>2.177.115.178</ip_addr>
<remote_host>unavailable</remote_host>
<user_agent>Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) snap
Chromium/74.0.3729.131 Chrome/74.0.3729.131 Safari/537.36</user_agent>
<port>33966</port>
<language>en-US,en;q=0.9,fa-IR;q=0.8,fa;q=0.7,de;q=0.6</language>
<referer></referer>
<connection></connection>
<keep_alive></keep_alive>
<method>GET</method>
<encoding>gzip, deflate, br</encoding>
<mime>text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3</mime>
<charset></charset>
<via>1.1 google</via>
<forwarded>2.177.115.178, 216.239.34.21</forwarded>
</info></td>
</tr>
<tr>
<td class="cli_command">$ curl ifconfig.me/all.json</td>
<td class="cli_arrow">⇒</td>
<td>{"ip_addr":"2.177.115.178","remote_host":"unavailable","user_agent":"Mozilla/5.0
(X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) snap Chromium/74.0.3729.131
Chrome/74.0.3729.131
Safari/537.36","port":33966,"language":"en-US,en;q=0.9,fa-IR;q=0.8,fa;q=0.7,de;q=0.6","method":"GET","encoding":"gzip,
deflate,
br","mime":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3","via":"1.1
google","forwarded":"2.177.115.178, 216.239.34.21"}</td>
</tr>
</table>
</div>
<div id="footer">© 2018 ifconfig.me</div>
</div>
</body>
</html>
I want the script to only download this part of the web page(or at least get close to that goal):
<div id="cli_wrap">
<h2>Command Line Interface</h2>
<table id="cli_table" summary="cli">
<tr>
<td class="cli_command">$ curl ifconfig.me</td>
<td class="cli_arrow">⇒</td>
<td>2.177.115.178</td>
</tr>
<tr>
<td class="cli_command">$ curl ifconfig.me/ip</td>
<td class="cli_arrow">⇒</td>
<td>2.177.115.178</td>
</tr>
<tr>
<td class="cli_command">$ curl ifconfig.me/host</td>
<td class="cli_arrow">⇒</td>
<td>unavailable</td>
</tr>
<tr>
<td class="cli_command">$ curl ifconfig.me/ua</td>
<td class="cli_arrow">⇒</td>
<td>Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) snap
Chromium/74.0.3729.131 Chrome/74.0.3729.131 Safari/537.36</td>
</tr>
<tr>
<td class="cli_command">$ curl ifconfig.me/port</td>
<td class="cli_arrow">⇒</td>
<td>33966</td>
</tr>
<tr>
<td class="cli_command">$ curl ifconfig.me/lang</td>
<td class="cli_arrow">⇒</td>
<td>en-US,en;q=0.9,fa-IR;q=0.8,fa;q=0.7,de;q=0.6</td>
</tr>
<tr>
<td class="cli_command">$ curl ifconfig.me/keepalive</td>
<td class="cli_arrow">⇒</td>
<td></td>
</tr>
<tr>
<td class="cli_command">$ curl ifconfig.me/connection</td>
<td class="cli_arrow">⇒</td>
<td></td>
</tr>
<tr>
<td class="cli_command">$ curl ifconfig.me/encoding</td>
<td class="cli_arrow">⇒</td>
<td>gzip, deflate, br</td>
</tr>
<tr>
<td class="cli_command">$ curl ifconfig.me/mime</td>
<td class="cli_arrow">⇒</td>
<td>text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3
</td>
</tr>
<tr>
<td class="cli_command">$ curl ifconfig.me/charset</td>
<td class="cli_arrow">⇒</td>
<td></td>
</tr>
<tr>
<td class="cli_command">$ curl ifconfig.me/via</td>
<td class="cli_arrow">⇒</td>
<td>1.1 google</td>
</tr>
<tr>
<td class="cli_command">$ curl ifconfig.me/forwarded</td>
<td class="cli_arrow">⇒</td>
<td>2.177.115.178, 216.239.34.21</td>
</tr>
<tr>
<td class="cli_command">$ curl ifconfig.me/all</td>
<td class="cli_arrow">⇒</td>
<td>
ip_addr: 2.177.115.178
<br>
remote_host: unavailable
<br>
user_agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) snap
Chromium/74.0.3729.131 Chrome/74.0.3729.131 Safari/537.36
<br>
port: 33966
<br>
language: en-US,en;q=0.9,fa-IR;q=0.8,fa;q=0.7,de;q=0.6
<br>
referer:
<br>
connection:
<br>
keep_alive:
<br>
method: GET
<br>
encoding: gzip, deflate, br
<br>
mime:
text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3
<br>
charset:
<br>
via: 1.1 google
<br>
forwarded: 2.177.115.178, 216.239.34.21
<br>
</td>
</tr>
<tr>
<td class="cli_command">$ curl ifconfig.me/all.xml</td>
<td class="cli_arrow">⇒</td>
<td><info>
<ip_addr>2.177.115.178</ip_addr>
<remote_host>unavailable</remote_host>
<user_agent>Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) snap
Chromium/74.0.3729.131 Chrome/74.0.3729.131 Safari/537.36</user_agent>
<port>33966</port>
<language>en-US,en;q=0.9,fa-IR;q=0.8,fa;q=0.7,de;q=0.6</language>
<referer></referer>
<connection></connection>
<keep_alive></keep_alive>
<method>GET</method>
<encoding>gzip, deflate, br</encoding>
<mime>text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3</mime>
<charset></charset>
<via>1.1 google</via>
<forwarded>2.177.115.178, 216.239.34.21</forwarded>
</info></td>
</tr>
<tr>
<td class="cli_command">$ curl ifconfig.me/all.json</td>
<td class="cli_arrow">⇒</td>
<td>{"ip_addr":"2.177.115.178","remote_host":"unavailable","user_agent":"Mozilla/5.0
(X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) snap Chromium/74.0.3729.131
Chrome/74.0.3729.131
Safari/537.36","port":33966,"language":"en-US,en;q=0.9,fa-IR;q=0.8,fa;q=0.7,de;q=0.6","method":"GET","encoding":"gzip,
deflate,
br","mime":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3","via":"1.1
google","forwarded":"2.177.115.178, 216.239.34.21"}</td>
</tr>
</table>
</div>
Edit2: The webpage I'm working with doesn't support the content length header either