0

I am trying to scrap webpages with Jsoup. Jsoup doesn't seem to capture the <input elements like Chrome does.

It is missing values such as these:

<input type=​"hidden" id=​"fileId" value=​"3168935269">
<input type=​"hidden" id=​"secondsLeft" value=​"20">​​

Using Jsoup I extracted these elements:

<input type="hidden" class="jsItemDirId" value="yRg1N-QP" />

<input type="hidden" class="jsItemFileId" value="i-EbooI0" />

<input type="hidden" id="fbAppId" value="255519317820035" />

<input type="hidden" id="sPrefix" value="http://search.4shared.com" />

<input type="hidden" class="sLink file" value="/q/CCAD/1" />

<input type="hidden" class="sLink video" value="/q/CCQD/1/video" />

<input type="hidden" class="sLink music" value="/q/CCQD/1/music" />

<input type="hidden" class="sLink photo" value="/q/CCQD/1/photo" />

<input type="hidden" class="sLink games" value="/q/CCQD/1/game" />

<input type="hidden" class="sLink book" value="/q/CCQD/1/books_office" />

<input type="hidden" class="sLink featured_videos" value="/q/CCQD/1/video" />

<input type="hidden" id="sBreadcrumbsPhrase" value="Searching" />

<input type="text" id="searchQuery" placeholder="Search files" />

<input type="hidden" id="interval" value="600000" />

<input type="hidden" id="archiveReadyDownload" value="Your file is ready for download:" />

<input type="hidden" id="defAvatar" value="http://static.4shared.com/images/user2.png?ver=2906097813" />

<input type="hidden" id="zipAvatar" value="http://static.4shared.com/icons/32x32/zip.png?ver=655479399" />

<input type="hidden" id="b1Avatar" value="http://static.4shared.com/icons/32x32/b1.png?ver=703417425" />

<input type="hidden" id="torrentAvatar" value="http://static.4shared.com/icons/32x32/torrent.png?ver=1628575404" />

<input type="hidden" id="contactRequestText" value="Your friend $[p1] just joined 4shared." />

<input type="button" value="Ok" onclick="checkAndStartDownload(event);" style="width:80px" />

<input type="button" value="Cancel" onclick="hideTermsOfUse();" />

<input type="hidden" id="startTitle" value="Share" />

<input type="hidden" id="sharingFolderTitle" value="Share folder" />

<input type="hidden" id="sharingFileTitle" value="Share file" />

<input type="hidden" id="placeHolderEnterEmailAdresses" value="Enter names or e-mail addresses" />

<input type="hidden" id="dLinkPay" value="Direct link is available only for Premium Users.&lt;br&gt; Sign Up to premium account to get all 4shared Premium Features." />

<input type="hidden" id="premiumRequired" value="Premium account required!" />

<input type="hidden" id="hosted" value="Hosted at" />

<input type="hidden" id="fbInviteFolderTitle" value="I've shared a folder with you on 4shared. Find out what it is!" />

<input type="hidden" id="fbInviteFileTitle" value="I've shared a file with you on 4shared. Find out what it is!" />

<input type="hidden" id="contacts" value="Contacts" />

<input type="hidden" id="fb_share_folder_img" value="http://static.4shared.com/images/facebook/share_folder.png?ver=2422162001" />

<input type="hidden" id="fb_share_file_img" value="http://static.4shared.com/images/facebook/share_file.png?ver=1565381062" />

<input type="hidden" id="fb_redir_param" value="https://www.4shared.com/servlet/signin/facebook?fp=https://www.4shared.com/account/home.jsp" />

<input type="hidden" id="fileSuccessfullSent" value="Your file was successfully sent" />

<input type="hidden" id="folderSuccessfullSent" value="Your folder was successfully sent" />

<input type="hidden" id="fbRequestSharedText" value="I'd like to share $[p0] with you" />

<input type="hidden" id="fbSharingOff" value="null" />

<input type="hidden" id="fbInviteText" value="4shared.com - free web-based file sharing and storage." />

<input type="radio" class="readFlag" name="permissions" value="read" checked="checked" />

<input type="radio" class="writeFlag" name="permissions" value="write" />

<input class="lucida dark-gray selectable" id="simpleViewLink" type="text" readonly="readonly" />

<input type="text" id="emails" class="lucida f12 dark-gray tags gaClick" data-element="shF-2-1" name="emails" tabindex="3" />

<input type="radio" class="readFlag" name="permissions" value="read" checked="checked" />

<input type="radio" class="writeFlag" name="permissions" value="write" />

<input type="text" id="downloadFileLink" class="lucida f12 selectable" name="" tabindex="3" />

<input type="text" class="lucida f12 dark-gray selectable" name="" tabindex="4" value="" id="premiumDirectLink" />

<input type="text" class="lucida f12 selectable" id="fileHTMLembed" name="" tabindex="3" />

<input type="text" id="fileForumEmbed" class="lucida f12 selectable" name="" tabindex="4" />

<input type="text" class="lucida f12 selectable" id="fileEmbed" tabindex="5" />

<input class="lucida f12 dark-gray selectable" id="searchFriendsInput" type="text" placeholder="Search by name or e-mail address" />

<input id="tags_2" type="text" class="tags" />

<input type="radio" class="readFlag" name="permissions" value="read" checked="checked" />

<input type="radio" class="writeFlag" name="permissions" value="write" />

<input type="radio" class="readFlag" name="permissions" value="read" checked="checked" />

<input type="radio" class="writeFlag" name="permissions" value="write" />

<input type="text" class="lucida f12 ffshadow dark-gray" name="" tabindex="4" value="" id="subdomainInput" />

<input type="text" class="lucida f12 ffshadow dark-gray" name="" tabindex="3" value="" id="subdomainValue" readonly="true" />

<input type="hidden" id="allreadyPasswordProtectedMess" value="You can't set password for this folder, because the parent folder '$[1]' is password protected." />

<input type="hidden" id="passwordChangeConfirmTitle" value="Password Change" />

<input type="hidden" id="passwordChangeConfirmBody" value="Some child directory already password protected. &lt;br/&gt; Changing password of current directory will cause password overwrite on children's " />

<input type="hidden" id="confirmButtonMsg" value="Change" />

<input type="hidden" id="cancelButtonMsg" value="Cancel" />

<input type="text" class="passInput lucida f12" name="" tabindex="4" value="" id="passwordInput" />

<input type="password" class="passInput lucida f12" name="" tabindex="4" value="" id="changePasswordInput" readonly="true" />

<input type="hidden" id="previewLinkForEmbed" />

<input type="hidden" id="previewLinkForWidget" />

<input class="lucida f12 dark-gray" id="widget_width" type="text" style="width:30px;" />

<input class="lucida f12 dark-gray" id="widget_height" type="text" style="width:30px;" />

<input type="text" class="lucida f12 dark-gray selectable" name="" tabindex="3" id="htmlEmbed" />

<input type="text" class="lucida f12 dark-gray selectable" name="" tabindex="4" id="forumEmbed" />

<input type="text" value="http://www.4shared.com/android/i-EbooI0/batman_hd.html" readonly="readonly" onclick="this.focus();this.select()" class="field1 gaClick" data-element="16" dir="ltr" />

<input type="text" value="&lt;a href=&quot;http://www.4shared.com/android/i-EbooI0/batman_hd.html&quot; target=_blank&gt;batman hd.apk&lt;/a&gt;" readonly="readonly" onclick="this.focus();this.select()" class="field1 gaClick" data-element="17" dir="ltr" />

<input type="text" value="[URL=http://www.4shared.com/android/i-EbooI0/batman_hd.html]batman hd.apk[/URL]" readonly="readonly" onclick="this.focus();this.select()" class="field1 gaClick" data-element="18" dir="ltr" />

<input type="hidden" name="showComments" value="true" />

<input type="hidden" name="showPart" value="commentList" />

<input type="hidden" name="replyId" value="" />

<input type="hidden" id="norecaptcha" name="norecaptcha" value="" />

<input type="hidden" name="start" value="0" />

<input id="submitCommBtn" type="submit" value="Add New Comment" class="gaClick floatLeft f11 marginT10 round4 lucida no-line sendCommentButton" data-element="32" />

<input type="text" class="input-gray-big wide round4" id="recaptcha_response_field" name="recaptcha_response_field" style="width:250px" />

<input class="field2" id="submitCommBtn" type="submit" value="Confirm" />

<input type="text" name="fileName" value="4shared" class="xBox" />

<input type="hidden" name="newValue" value="" />

<input type="hidden" name="mode" value="" />

<input type="hidden" name="fid" value="3168935269" />

<input type="hidden" name="mode" value="3" />

<input type="hidden" name="fid" value="3168935269" />

<input type="submit" value="Save" class="bluePopupButton marginT15 round5 f12 floatLeft marginR10" />

<input type="button" value="Cancel" class="grayPopupButton marginT15 round5 f12 floatLeft" onclick="quickEditCancel(1)" />

<input type="hidden" name="mode" value="3" />

<input type="hidden" name="fid" value="3168935269" />

<input type="submit" value="Save" class="bluePopupButton marginT15 round5 f12 floatLeft marginR10" />

<input type="button" value="Cancel" class="grayPopupButton marginT15 round5 f12 floatLeft" onclick="quickEditCancel(1)" />

<input type="hidden" name="mode" value="3" />

<input type="hidden" name="fid" value="3168935269" />

<input type="submit" value="Save" class="bluePopupButton marginT15 round5 f12 floatLeft marginR10" />

<input type="button" value="Cancel" class="grayPopupButton marginT15 round5 f12 floatLeft" onclick="quickEditCancel(1)" />

<input type="text" name="newValue" class="xBox" style="width:200px" />

<input type="hidden" name="mode" value="2" />

<input type="hidden" name="fid" value="3168935269" />

<input type="submit" value="Save" class="bluePopupButton marginT15 round5 f12 floatLeft marginR10" />

<input type="button" value="Cancel" class="grayPopupButton marginT15 round5 f12" onclick="quickEditCancel(1)" />

<input type="text" name="newValue" class="xBox" style="width:330px" onkeypress="return quickEditIsValidCharForFileName(event);" />

<input type="hidden" name="mode" value="10" />

<input type="hidden" name="fid" value="3168935269" />

<input type="submit" value="Save" class="bluePopupButton marginT15 round5 f12 floatRight marginL10" />

<input type="button" value="Cancel" class="grayPopupButton marginT15 round5 f12 floatRight" onclick="quickEditCancel();" />

<input type="hidden" name="mode" value="3" />

<input type="hidden" name="did" value="0" />

<input type="submit" value="Save" class="bluePopupButton marginT15 round5 f12 floatLeft marginR10" />

<input type="button" value="Cancel" class="grayPopupButton marginT15 round5 f12 floatLeft" onclick="quickEditCancel(1)" />

<input type="text" name="searchName" style="width:250px;padding:1px 0" class="ajax-suggestion field gaClick" data-element="fs1" autocomplete="off" />

<input type="submit" name="submitButton" value="Search" class="button gaClick" data-element="fs3" />

<input type="hidden" name="searchmode" value="2" />

Using try.jsoup.com also did not yield these input types like Chrome which suggests that it is not my code but rather Jsoup.

Reading through other threads suggest that Javascript may be changing the html after loading the webpage. There were no viable answers on how to fix this.

What am I doing wrong and how do I fix it?

This is my code for getting the full html page:

Document doc = Jsoup.connect("http://www.4shared.com/get/i-EbooI0/batman_hd.html").timeout(0).get();
System.out.println(doc.toString() + "\n\n\n\n");
Elements links = doc.select("input[type=hidden]");
for (org.jsoup.nodes.Element link : links) {
    System.out.println(link);
}

View Screenshot of needed values here

enter image description here

SOLUTION

Connection.Response response = Jsoup.connect("myUrl")
    .method(Connection.Method.GET)
    .execute();

Document homePage = Jsoup.connect("myUrl")
    .cookies(response.cookies())
    .get();

Modified version of code described here: Jsoup Cookies for HTTPS scraping. This gets the cookies as suggested by Niranjan and then reconnects to your Url.

Community
  • 1
  • 1
horvste
  • 636
  • 6
  • 19

1 Answers1

5

Jsoup will clean up your HTML content while parsing and also It can handle your HTML though its not well-formed. Try to dump the html after parsing i.e, Document.html() and check the dump if your discarded elements are eligible for your select clause.

UPDATE

Here you go, try this out, I'll explain you things if this works!!

public static void main(String[] args) throws IOException
{

    try
    {
        Map<String, String> cookieMap = new HashMap<String, String>();
        cookieMap.put("day1host", "h");
        cookieMap.put("d1.loginity.mark", "1");
        cookieMap.put("hostid", "-1314014314");
        cookieMap.put("__qca", "P0-2042580316-1371938383086");
        cookieMap.put("cd1v", "OOhB");
        cookieMap.put("c29", "1");
        cookieMap.put("__utma", "210074320.280144312.1371938377.1371938377.1371938377.1");
        cookieMap.put("__utmb", "210074320.4.10.1371938377");
        cookieMap.put("__utmc", "210074320");
        cookieMap.put("__utmz", "210074320.1371938377.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none)");


        Document document = Jsoup.connect("http://www.4shared.com/get/i-EbooI0/batman_hd.html")
        .userAgent("Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.110 Safari/537.36")
        .followRedirects(true)
        .cookies(cookieMap)
        .get();
        //System.out.println(document.html());
        //System.out.println("====================================================================");
        Elements elements = document.select("input[type=hidden]");
        for (Iterator<Element> iterator = elements.iterator(); iterator.hasNext();)
        {
            Element element = iterator.next();
            System.out.println(element);

        }
    }
    catch (Exception e)
    {
        e.printStackTrace();
    }

}

EXPLANATION

Im not sure if the below pattern is same for all theURL's you are trying.

This is how the site is responding.

  1. There is a site redirection from /get/i-EbooI0/batman_hd.html to android/i-EbooI0/batman_hd.html. While redirection its sending out 2 cookies in response to the 1st request.

    1st request

  2. Few more cookies on the 2nd request.

    2nd request

    No hidden fields in the <body> yet. Confirm this looking into the Elements tab.

  3. Now request http://www.4shared.com/get/i-EbooI0/batman_hd.html in the browser.

    3rd Request

    Now you have the required Hidden fields in the <body>.

    enter image description here

Im performing Step 3 directly in the code.


Conclusion :

If you observe the same behavior for other URL's as well then you have to write the code to catch the cookies of a Response and then pass them in the subsequent Request until you get the desired Hidden fields.

Niranjan
  • 1,776
  • 1
  • 13
  • 21
  • I need to get every "hidden" value in this document=view-source:http://www.4shared.com/get/i-EbooI0/batman_hd.html. This is Chrome's source of the webpage. JSoup does not give me all the Hidden values. What am I missing? I previously tried Doc.html() and it did not help – horvste Jun 22 '13 at 18:14
  • @user2489210 Its fetching me **41** hidden fields in both the cases 1. `Reading from the URL directly` and 2.`Saving the Browser(Chrome) content to html file and reading from that html`. How do you know those hidden fields, which are stated in question, are missing? – Niranjan Jun 22 '13 at 18:43
  • Because the specific hidden field I am looking for is ommited in the JSOUP, fetched html. I would like to get all the hidden fields but am specifically looking for this one: This is NOT listed in my Jsoup code – horvste Jun 22 '13 at 18:53
  • @user2489210 That is not listed in the html code when viewed in Chrome as well – Niranjan Jun 22 '13 at 19:19
  • when using inspect element it is. Will post screenshot – horvste Jun 22 '13 at 19:21
  • How would one extract the – horvste Jun 22 '13 at 21:35
  • @user2489210 I edited the answer to reflect the code you are looking for. Please update me the results after running the code. – Niranjan Jun 22 '13 at 22:22
  • This code works for one url. I tried it for a few cases and it has only worked for that one specific url. For the other cases chrome was still able to pick up the baseDownload hidden – horvste Jun 22 '13 at 22:40
  • I cannot find the android/i-EbooI0/batman_hd.html. Does it matter if I dont find it? It seems all the data in /get/i-EbooI0/batman_hd.html contains everything that is in the android/i-EbooI0/batman_hd.html. – horvste Jun 23 '13 at 00:04
  • 1
    @user2489210 Clear your browser cookies before you try – Niranjan Jun 23 '13 at 00:13