3

I'm currently working on a python script that searches for select data on a webpage.

For context, it looks up some word phonetics from an online dictionary, and does so for a few other similar words as well( Similar to what google transliterator does ). The problem is that each webpage needs to be downloaded completely in order for me to extract the data that i need( which unfortunately is towards the end of the webpage source).

I wanted to know if there's any way to access a specific element of a webpage, without downloading all the data.

Here is my snippet of code that currently does this:

for i in SuggestionJson['suggestions']:
      webpage = requests.get("https://www.vajehyab.com" + i['link'] + "&t=like") #download whole webpage
      soup = BeautifulSoup(webpage.content, 'html.parser')
      phonetic = soup.find("div", {"id": "wordbox"}).section.header.h3.text.replace('/','') #extract data from div
      if(phonetic != ''): #save to file
            f.write(phonetic)

What i have in mind is for it to skip downloading elements like <head> and skip every other <div> element that doesn't match the id i want. Is this possible?

Edit: For example say i have the following html(from ifconfig.me) code:

<!DOCTYPE html>
<html lang="en">

<head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
    <meta http-equiv="content-style-type" content="text/css" />
    <meta http-equiv="content-script-type" content="text/javascript" />
    <meta http-equiv="content-language" content="en" />
    <meta http-equiv="pragma" content="no-cache" />
    <meta http-equiv="cache-control" content="no-cache" />
    <meta name="description" content="Get my IP Address" />
    <meta name="keywords" content="ip address ifconfig ifconfig.me" />
    <meta name="author" content="" />
    <link rel="shortcut icon" href="favicon.ico" />
    <link rel="canonical" href="https://ipinfo.io/">
    <title>What Is My IP Address? - ifconfig.me</title>
    <meta name="viewport" content="width=device-width, initial-scale=1">
    <link href="/styles/style.css" rel="stylesheet" type="text/css">
</head>

<body>
    <div id="container" class="clearfix">
        <div id="header">
            <table>
                <tr>
                    <td>
                        <h1><a href="http://ifconfig.me">What Is My IP Address? - ifconfig.me</a></h1>
                    </td>
                    <td></td>
                </tr>
                <tr>
                    <td></td>
                    <td>
                        <div id="plungins">
                            <div class="plungin" id="button_facebook">
                                <div id="fb-root"></div>
                                <script src="http://connect.facebook.net/en_US/all.js#xfbml=1"></script>
                                <fb:like href="http://ifconfig.me/" send="false" layout="button_count" width="100"
                                    show_faces="true" font=""></fb:like>
                            </div>

                            <div class="plungin" id="button_twitter">
                                <a href="http://twitter.com/share" class="twitter-share-button"
                                    data-url="http://ifconfig.me/" data-text="What Is My IP Address? - ifconfig.me
           " data-count="horizontal"></a>
                                <script type="text/javascript" src="http://platform.twitter.com/widgets.js"></script>
                            </div>

                            <div class="plungin" id="button_plusone">
                                <!-- Place this tag where you want the +1 button to render -->
                                <g:plusone size="medium" href="http://ifconfig.me/"></g:plusone>
                                <!-- Place this render call where appropriate -->
                                <script type="text/javascript">
                                    (function () {
                                        var po = document.createElement('script');
                                        po.type = 'text/javascript';
                                        po.async = true;
                                        po.src = 'https://apis.google.com/js/plusone.js';
                                        var s = document.getElementsByTagName('script')[0];
                                        s.parentNode.insertBefore(po, s);
                                    })();
                                </script>
                            </div>
                        </div>
                    </td>
                </tr>
            </table>
        </div>
        <div id="info_area">
            <h2>Your Connection</h2>
            <table id="info_table" summary="info">
                <tr>
                    <td class="info_table_label">IP Address</td>
                    <td id="ip_address_cell"><strong id="ip_address">2.177.115.178</strong></td>
                </tr>
                <tr>
                    <td class="info_table_label">Remote Host</td>
                    <td>unavailable</td>
                </tr>
                <tr>
                    <td class="info_table_label">User Agent</td>
                    <td>Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) snap
                        Chromium/74.0.3729.131 Chrome/74.0.3729.131 Safari/537.36</td>
                </tr>
                <tr>
                    <td class="info_table_label">Port</td>
                    <td>33966</td>
                </tr>
                <tr>
                    <td class="info_table_label">Language</td>
                    <td>en-US,en;q=0.9,fa-IR;q=0.8,fa;q=0.7,de;q=0.6</td>
                </tr>
                <tr>
                    <td class="info_table_label">Referer</td>
                    <td></td>
                </tr>
                <tr>
                    <td class="info_table_label">Connection</td>
                    <td></td>
                </tr>
                <tr>
                    <td class="info_table_label">KeepAlive</td>
                    <td></td>
                </tr>
                <tr>
                    <td class="info_table_label">Method</td>
                    <td>GET</td>
                </tr>
                <tr>
                    <td class="info_table_label">Encoding</td>
                    <td>gzip, deflate, br</td>
                </tr>
                <tr>
                    <td class="info_table_label">MIME Type</td>
                    <td> text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3
                    </td>
                </tr>
                <tr>
                    <td class="info_table_label">Charset</td>
                    <td></td>
                </tr>
                <tr>
                    <td class="info_table_label">Via</td>
                    <td>1.1 google</td>
                </tr>
                <tr>
                    <td class="info_table_label">X-Forwarded-For</td>
                    <td>2.177.115.178, 216.239.34.21</td>
                </tr>
            </table>
        </div>
        <!--<div id="middle"></div>-->
        <div id="cli_wrap">
            <h2>Command Line Interface</h2>
            <table id="cli_table" summary="cli">
                <tr>
                    <td class="cli_command">$ curl ifconfig.me</td>
                    <td class="cli_arrow">&rArr;</td>
                    <td>2.177.115.178</td>
                </tr>
                <tr>
                    <td class="cli_command">$ curl ifconfig.me/ip</td>
                    <td class="cli_arrow">&rArr;</td>
                    <td>2.177.115.178</td>
                </tr>
                <tr>
                    <td class="cli_command">$ curl ifconfig.me/host</td>
                    <td class="cli_arrow">&rArr;</td>
                    <td>unavailable</td>
                </tr>
                <tr>
                    <td class="cli_command">$ curl ifconfig.me/ua</td>
                    <td class="cli_arrow">&rArr;</td>
                    <td>Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) snap
                        Chromium/74.0.3729.131 Chrome/74.0.3729.131 Safari/537.36</td>
                </tr>
                <tr>
                    <td class="cli_command">$ curl ifconfig.me/port</td>
                    <td class="cli_arrow">&rArr;</td>
                    <td>33966</td>
                </tr>
                <tr>
                    <td class="cli_command">$ curl ifconfig.me/lang</td>
                    <td class="cli_arrow">&rArr;</td>
                    <td>en-US,en;q=0.9,fa-IR;q=0.8,fa;q=0.7,de;q=0.6</td>
                </tr>
                <tr>
                    <td class="cli_command">$ curl ifconfig.me/keepalive</td>
                    <td class="cli_arrow">&rArr;</td>
                    <td></td>
                </tr>
                <tr>
                    <td class="cli_command">$ curl ifconfig.me/connection</td>
                    <td class="cli_arrow">&rArr;</td>
                    <td></td>
                </tr>
                <tr>
                    <td class="cli_command">$ curl ifconfig.me/encoding</td>
                    <td class="cli_arrow">&rArr;</td>
                    <td>gzip, deflate, br</td>
                </tr>
                <tr>
                    <td class="cli_command">$ curl ifconfig.me/mime</td>
                    <td class="cli_arrow">&rArr;</td>
                    <td>text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3
                    </td>
                </tr>
                <tr>
                    <td class="cli_command">$ curl ifconfig.me/charset</td>
                    <td class="cli_arrow">&rArr;</td>
                    <td></td>
                </tr>
                <tr>
                    <td class="cli_command">$ curl ifconfig.me/via</td>
                    <td class="cli_arrow">&rArr;</td>
                    <td>1.1 google</td>
                </tr>
                <tr>
                    <td class="cli_command">$ curl ifconfig.me/forwarded</td>
                    <td class="cli_arrow">&rArr;</td>
                    <td>2.177.115.178, 216.239.34.21</td>
                </tr>
                <tr>
                    <td class="cli_command">$ curl ifconfig.me/all</td>
                    <td class="cli_arrow">&rArr;</td>
                    <td>

                        ip_addr: 2.177.115.178
                        <br>

                        remote_host: unavailable
                        <br>

                        user_agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) snap
                        Chromium/74.0.3729.131 Chrome/74.0.3729.131 Safari/537.36
                        <br>

                        port: 33966
                        <br>

                        language: en-US,en;q=0.9,fa-IR;q=0.8,fa;q=0.7,de;q=0.6
                        <br>

                        referer:
                        <br>

                        connection:
                        <br>

                        keep_alive:
                        <br>

                        method: GET
                        <br>

                        encoding: gzip, deflate, br
                        <br>

                        mime:
                        text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3
                        <br>

                        charset:
                        <br>

                        via: 1.1 google
                        <br>

                        forwarded: 2.177.115.178, 216.239.34.21
                        <br>

                    </td>
                </tr>
                <tr>
                    <td class="cli_command">$ curl ifconfig.me/all.xml</td>
                    <td class="cli_arrow">&rArr;</td>
                    <td>&lt;info&gt;
                        &lt;ip_addr&gt;2.177.115.178&lt;/ip_addr&gt;
                        &lt;remote_host&gt;unavailable&lt;/remote_host&gt;
                        &lt;user_agent&gt;Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) snap
                        Chromium/74.0.3729.131 Chrome/74.0.3729.131 Safari/537.36&lt;/user_agent&gt;
                        &lt;port&gt;33966&lt;/port&gt;
                        &lt;language&gt;en-US,en;q=0.9,fa-IR;q=0.8,fa;q=0.7,de;q=0.6&lt;/language&gt;
                        &lt;referer&gt;&lt;/referer&gt;
                        &lt;connection&gt;&lt;/connection&gt;
                        &lt;keep_alive&gt;&lt;/keep_alive&gt;
                        &lt;method&gt;GET&lt;/method&gt;
                        &lt;encoding&gt;gzip, deflate, br&lt;/encoding&gt;
                        &lt;mime&gt;text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3&lt;/mime&gt;
                        &lt;charset&gt;&lt;/charset&gt;
                        &lt;via&gt;1.1 google&lt;/via&gt;
                        &lt;forwarded&gt;2.177.115.178, 216.239.34.21&lt;/forwarded&gt;
                        &lt;/info&gt;</td>
                </tr>
                <tr>
                    <td class="cli_command">$ curl ifconfig.me/all.json</td>
                    <td class="cli_arrow">&rArr;</td>
                    <td>{&quot;ip_addr&quot;:&quot;2.177.115.178&quot;,&quot;remote_host&quot;:&quot;unavailable&quot;,&quot;user_agent&quot;:&quot;Mozilla/5.0
                        (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) snap Chromium/74.0.3729.131
                        Chrome/74.0.3729.131
                        Safari/537.36&quot;,&quot;port&quot;:33966,&quot;language&quot;:&quot;en-US,en;q=0.9,fa-IR;q=0.8,fa;q=0.7,de;q=0.6&quot;,&quot;method&quot;:&quot;GET&quot;,&quot;encoding&quot;:&quot;gzip,
                        deflate,
                        br&quot;,&quot;mime&quot;:&quot;text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3&quot;,&quot;via&quot;:&quot;1.1
                        google&quot;,&quot;forwarded&quot;:&quot;2.177.115.178, 216.239.34.21&quot;}</td>
                </tr>
            </table>
        </div>
        <div id="footer">&copy; 2018 ifconfig.me</div>
    </div>
</body>

</html>

I want the script to only download this part of the web page(or at least get close to that goal):

<div id="cli_wrap">
    <h2>Command Line Interface</h2>
    <table id="cli_table" summary="cli">
        <tr>
            <td class="cli_command">$ curl ifconfig.me</td>
            <td class="cli_arrow">&rArr;</td>
            <td>2.177.115.178</td>
        </tr>
        <tr>
            <td class="cli_command">$ curl ifconfig.me/ip</td>
            <td class="cli_arrow">&rArr;</td>
            <td>2.177.115.178</td>
        </tr>
        <tr>
            <td class="cli_command">$ curl ifconfig.me/host</td>
            <td class="cli_arrow">&rArr;</td>
            <td>unavailable</td>
        </tr>
        <tr>
            <td class="cli_command">$ curl ifconfig.me/ua</td>
            <td class="cli_arrow">&rArr;</td>
            <td>Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) snap
                Chromium/74.0.3729.131 Chrome/74.0.3729.131 Safari/537.36</td>
        </tr>
        <tr>
            <td class="cli_command">$ curl ifconfig.me/port</td>
            <td class="cli_arrow">&rArr;</td>
            <td>33966</td>
        </tr>
        <tr>
            <td class="cli_command">$ curl ifconfig.me/lang</td>
            <td class="cli_arrow">&rArr;</td>
            <td>en-US,en;q=0.9,fa-IR;q=0.8,fa;q=0.7,de;q=0.6</td>
        </tr>
        <tr>
            <td class="cli_command">$ curl ifconfig.me/keepalive</td>
            <td class="cli_arrow">&rArr;</td>
            <td></td>
        </tr>
        <tr>
            <td class="cli_command">$ curl ifconfig.me/connection</td>
            <td class="cli_arrow">&rArr;</td>
            <td></td>
        </tr>
        <tr>
            <td class="cli_command">$ curl ifconfig.me/encoding</td>
            <td class="cli_arrow">&rArr;</td>
            <td>gzip, deflate, br</td>
        </tr>
        <tr>
            <td class="cli_command">$ curl ifconfig.me/mime</td>
            <td class="cli_arrow">&rArr;</td>
            <td>text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3
            </td>
        </tr>
        <tr>
            <td class="cli_command">$ curl ifconfig.me/charset</td>
            <td class="cli_arrow">&rArr;</td>
            <td></td>
        </tr>
        <tr>
            <td class="cli_command">$ curl ifconfig.me/via</td>
            <td class="cli_arrow">&rArr;</td>
            <td>1.1 google</td>
        </tr>
        <tr>
            <td class="cli_command">$ curl ifconfig.me/forwarded</td>
            <td class="cli_arrow">&rArr;</td>
            <td>2.177.115.178, 216.239.34.21</td>
        </tr>
        <tr>
            <td class="cli_command">$ curl ifconfig.me/all</td>
            <td class="cli_arrow">&rArr;</td>
            <td>

                ip_addr: 2.177.115.178
                <br>

                remote_host: unavailable
                <br>

                user_agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) snap
                Chromium/74.0.3729.131 Chrome/74.0.3729.131 Safari/537.36
                <br>

                port: 33966
                <br>

                language: en-US,en;q=0.9,fa-IR;q=0.8,fa;q=0.7,de;q=0.6
                <br>

                referer:
                <br>

                connection:
                <br>

                keep_alive:
                <br>

                method: GET
                <br>

                encoding: gzip, deflate, br
                <br>

                mime:
                text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3
                <br>

                charset:
                <br>

                via: 1.1 google
                <br>

                forwarded: 2.177.115.178, 216.239.34.21
                <br>

            </td>
        </tr>
        <tr>
            <td class="cli_command">$ curl ifconfig.me/all.xml</td>
            <td class="cli_arrow">&rArr;</td>
            <td>&lt;info&gt;
                &lt;ip_addr&gt;2.177.115.178&lt;/ip_addr&gt;
                &lt;remote_host&gt;unavailable&lt;/remote_host&gt;
                &lt;user_agent&gt;Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) snap
                Chromium/74.0.3729.131 Chrome/74.0.3729.131 Safari/537.36&lt;/user_agent&gt;
                &lt;port&gt;33966&lt;/port&gt;
                &lt;language&gt;en-US,en;q=0.9,fa-IR;q=0.8,fa;q=0.7,de;q=0.6&lt;/language&gt;
                &lt;referer&gt;&lt;/referer&gt;
                &lt;connection&gt;&lt;/connection&gt;
                &lt;keep_alive&gt;&lt;/keep_alive&gt;
                &lt;method&gt;GET&lt;/method&gt;
                &lt;encoding&gt;gzip, deflate, br&lt;/encoding&gt;
                &lt;mime&gt;text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3&lt;/mime&gt;
                &lt;charset&gt;&lt;/charset&gt;
                &lt;via&gt;1.1 google&lt;/via&gt;
                &lt;forwarded&gt;2.177.115.178, 216.239.34.21&lt;/forwarded&gt;
                &lt;/info&gt;</td>
        </tr>
        <tr>
            <td class="cli_command">$ curl ifconfig.me/all.json</td>
            <td class="cli_arrow">&rArr;</td>
            <td>{&quot;ip_addr&quot;:&quot;2.177.115.178&quot;,&quot;remote_host&quot;:&quot;unavailable&quot;,&quot;user_agent&quot;:&quot;Mozilla/5.0
                (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) snap Chromium/74.0.3729.131
                Chrome/74.0.3729.131
                Safari/537.36&quot;,&quot;port&quot;:33966,&quot;language&quot;:&quot;en-US,en;q=0.9,fa-IR;q=0.8,fa;q=0.7,de;q=0.6&quot;,&quot;method&quot;:&quot;GET&quot;,&quot;encoding&quot;:&quot;gzip,
                deflate,
                br&quot;,&quot;mime&quot;:&quot;text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3&quot;,&quot;via&quot;:&quot;1.1
                google&quot;,&quot;forwarded&quot;:&quot;2.177.115.178, 216.239.34.21&quot;}</td>
        </tr>
    </table>
</div>

Edit2: The webpage I'm working with doesn't support the content length header either

Xosrov
  • 719
  • 4
  • 22

1 Answers1

1

Not exactly the way you envision it. You want to instruct the web server to skip past some of its content based on tags, and that, while theoretically possible, is not something that will happen on regular web pages. (If it were some kind of API, maybe, but you are scraping regular web pages.)

There is something you can look in to though. There is something called HTTP range requests - instead of asking for a full file, you ask for a range of the file. If you know that the web page is about 100 kilobytes for instance, but that the tags you are looking for are somewhere in the last 3 kilobytes, you can ask the web server to send only the last 3 kilobytes to you.

Depending on how the web server and software behind it is setup, that can work. Example with python requests. If the page is dynamically generated, typically the web server will not honor your range request and send you the full page instead.

(If that works, it's not certain that BeautifulSoup can make sense of the fragmented HTML you will get. But it's possible, it's very tolerant!)

Prof. Falken
  • 24,226
  • 19
  • 100
  • 173
  • Very helpful solution! The website I'm working with has a constant format where this could work, but i was thinking of a more general solution.. Is there any way a "special" web request could be sent to get this info? (Sorry I'm not informed on this matter) – Xosrov May 07 '19 at 10:16
  • I just tested this and unfortunately it doesn't work.. i guess the webpage doesn't support it? Is there any other way? – Xosrov May 08 '19 at 12:02
  • @Alireza you can see in the headers you get back in the response if it supports it. I don't think there is another way. You can't force the server to respond in a way it's not programmed to respond. Consider what the server does - it builds a response dynamically with a complete HTML document, probably with some kind of framework or a PHP page or something. You want it to build only a fraction of a web page... besides the server code needing redesign, it possibly is just as much work for the backend to do half a page as a full page. (With database connections or whatever.) – Prof. Falken May 08 '19 at 14:11