I am trying to scrap webpages with Jsoup. Jsoup doesn't seem to capture the <input
elements like Chrome does.
It is missing values such as these:
<input type="hidden" id="fileId" value="3168935269">
<input type="hidden" id="secondsLeft" value="20">
Using Jsoup I extracted these elements:
<input type="hidden" class="jsItemDirId" value="yRg1N-QP" />
<input type="hidden" class="jsItemFileId" value="i-EbooI0" />
<input type="hidden" id="fbAppId" value="255519317820035" />
<input type="hidden" id="sPrefix" value="http://search.4shared.com" />
<input type="hidden" class="sLink file" value="/q/CCAD/1" />
<input type="hidden" class="sLink video" value="/q/CCQD/1/video" />
<input type="hidden" class="sLink music" value="/q/CCQD/1/music" />
<input type="hidden" class="sLink photo" value="/q/CCQD/1/photo" />
<input type="hidden" class="sLink games" value="/q/CCQD/1/game" />
<input type="hidden" class="sLink book" value="/q/CCQD/1/books_office" />
<input type="hidden" class="sLink featured_videos" value="/q/CCQD/1/video" />
<input type="hidden" id="sBreadcrumbsPhrase" value="Searching" />
<input type="text" id="searchQuery" placeholder="Search files" />
<input type="hidden" id="interval" value="600000" />
<input type="hidden" id="archiveReadyDownload" value="Your file is ready for download:" />
<input type="hidden" id="defAvatar" value="http://static.4shared.com/images/user2.png?ver=2906097813" />
<input type="hidden" id="zipAvatar" value="http://static.4shared.com/icons/32x32/zip.png?ver=655479399" />
<input type="hidden" id="b1Avatar" value="http://static.4shared.com/icons/32x32/b1.png?ver=703417425" />
<input type="hidden" id="torrentAvatar" value="http://static.4shared.com/icons/32x32/torrent.png?ver=1628575404" />
<input type="hidden" id="contactRequestText" value="Your friend $[p1] just joined 4shared." />
<input type="button" value="Ok" onclick="checkAndStartDownload(event);" style="width:80px" />
<input type="button" value="Cancel" onclick="hideTermsOfUse();" />
<input type="hidden" id="startTitle" value="Share" />
<input type="hidden" id="sharingFolderTitle" value="Share folder" />
<input type="hidden" id="sharingFileTitle" value="Share file" />
<input type="hidden" id="placeHolderEnterEmailAdresses" value="Enter names or e-mail addresses" />
<input type="hidden" id="dLinkPay" value="Direct link is available only for Premium Users.<br> Sign Up to premium account to get all 4shared Premium Features." />
<input type="hidden" id="premiumRequired" value="Premium account required!" />
<input type="hidden" id="hosted" value="Hosted at" />
<input type="hidden" id="fbInviteFolderTitle" value="I've shared a folder with you on 4shared. Find out what it is!" />
<input type="hidden" id="fbInviteFileTitle" value="I've shared a file with you on 4shared. Find out what it is!" />
<input type="hidden" id="contacts" value="Contacts" />
<input type="hidden" id="fb_share_folder_img" value="http://static.4shared.com/images/facebook/share_folder.png?ver=2422162001" />
<input type="hidden" id="fb_share_file_img" value="http://static.4shared.com/images/facebook/share_file.png?ver=1565381062" />
<input type="hidden" id="fb_redir_param" value="https://www.4shared.com/servlet/signin/facebook?fp=https://www.4shared.com/account/home.jsp" />
<input type="hidden" id="fileSuccessfullSent" value="Your file was successfully sent" />
<input type="hidden" id="folderSuccessfullSent" value="Your folder was successfully sent" />
<input type="hidden" id="fbRequestSharedText" value="I'd like to share $[p0] with you" />
<input type="hidden" id="fbSharingOff" value="null" />
<input type="hidden" id="fbInviteText" value="4shared.com - free web-based file sharing and storage." />
<input type="radio" class="readFlag" name="permissions" value="read" checked="checked" />
<input type="radio" class="writeFlag" name="permissions" value="write" />
<input class="lucida dark-gray selectable" id="simpleViewLink" type="text" readonly="readonly" />
<input type="text" id="emails" class="lucida f12 dark-gray tags gaClick" data-element="shF-2-1" name="emails" tabindex="3" />
<input type="radio" class="readFlag" name="permissions" value="read" checked="checked" />
<input type="radio" class="writeFlag" name="permissions" value="write" />
<input type="text" id="downloadFileLink" class="lucida f12 selectable" name="" tabindex="3" />
<input type="text" class="lucida f12 dark-gray selectable" name="" tabindex="4" value="" id="premiumDirectLink" />
<input type="text" class="lucida f12 selectable" id="fileHTMLembed" name="" tabindex="3" />
<input type="text" id="fileForumEmbed" class="lucida f12 selectable" name="" tabindex="4" />
<input type="text" class="lucida f12 selectable" id="fileEmbed" tabindex="5" />
<input class="lucida f12 dark-gray selectable" id="searchFriendsInput" type="text" placeholder="Search by name or e-mail address" />
<input id="tags_2" type="text" class="tags" />
<input type="radio" class="readFlag" name="permissions" value="read" checked="checked" />
<input type="radio" class="writeFlag" name="permissions" value="write" />
<input type="radio" class="readFlag" name="permissions" value="read" checked="checked" />
<input type="radio" class="writeFlag" name="permissions" value="write" />
<input type="text" class="lucida f12 ffshadow dark-gray" name="" tabindex="4" value="" id="subdomainInput" />
<input type="text" class="lucida f12 ffshadow dark-gray" name="" tabindex="3" value="" id="subdomainValue" readonly="true" />
<input type="hidden" id="allreadyPasswordProtectedMess" value="You can't set password for this folder, because the parent folder '$[1]' is password protected." />
<input type="hidden" id="passwordChangeConfirmTitle" value="Password Change" />
<input type="hidden" id="passwordChangeConfirmBody" value="Some child directory already password protected. <br/> Changing password of current directory will cause password overwrite on children's " />
<input type="hidden" id="confirmButtonMsg" value="Change" />
<input type="hidden" id="cancelButtonMsg" value="Cancel" />
<input type="text" class="passInput lucida f12" name="" tabindex="4" value="" id="passwordInput" />
<input type="password" class="passInput lucida f12" name="" tabindex="4" value="" id="changePasswordInput" readonly="true" />
<input type="hidden" id="previewLinkForEmbed" />
<input type="hidden" id="previewLinkForWidget" />
<input class="lucida f12 dark-gray" id="widget_width" type="text" style="width:30px;" />
<input class="lucida f12 dark-gray" id="widget_height" type="text" style="width:30px;" />
<input type="text" class="lucida f12 dark-gray selectable" name="" tabindex="3" id="htmlEmbed" />
<input type="text" class="lucida f12 dark-gray selectable" name="" tabindex="4" id="forumEmbed" />
<input type="text" value="http://www.4shared.com/android/i-EbooI0/batman_hd.html" readonly="readonly" onclick="this.focus();this.select()" class="field1 gaClick" data-element="16" dir="ltr" />
<input type="text" value="<a href="http://www.4shared.com/android/i-EbooI0/batman_hd.html" target=_blank>batman hd.apk</a>" readonly="readonly" onclick="this.focus();this.select()" class="field1 gaClick" data-element="17" dir="ltr" />
<input type="text" value="[URL=http://www.4shared.com/android/i-EbooI0/batman_hd.html]batman hd.apk[/URL]" readonly="readonly" onclick="this.focus();this.select()" class="field1 gaClick" data-element="18" dir="ltr" />
<input type="hidden" name="showComments" value="true" />
<input type="hidden" name="showPart" value="commentList" />
<input type="hidden" name="replyId" value="" />
<input type="hidden" id="norecaptcha" name="norecaptcha" value="" />
<input type="hidden" name="start" value="0" />
<input id="submitCommBtn" type="submit" value="Add New Comment" class="gaClick floatLeft f11 marginT10 round4 lucida no-line sendCommentButton" data-element="32" />
<input type="text" class="input-gray-big wide round4" id="recaptcha_response_field" name="recaptcha_response_field" style="width:250px" />
<input class="field2" id="submitCommBtn" type="submit" value="Confirm" />
<input type="text" name="fileName" value="4shared" class="xBox" />
<input type="hidden" name="newValue" value="" />
<input type="hidden" name="mode" value="" />
<input type="hidden" name="fid" value="3168935269" />
<input type="hidden" name="mode" value="3" />
<input type="hidden" name="fid" value="3168935269" />
<input type="submit" value="Save" class="bluePopupButton marginT15 round5 f12 floatLeft marginR10" />
<input type="button" value="Cancel" class="grayPopupButton marginT15 round5 f12 floatLeft" onclick="quickEditCancel(1)" />
<input type="hidden" name="mode" value="3" />
<input type="hidden" name="fid" value="3168935269" />
<input type="submit" value="Save" class="bluePopupButton marginT15 round5 f12 floatLeft marginR10" />
<input type="button" value="Cancel" class="grayPopupButton marginT15 round5 f12 floatLeft" onclick="quickEditCancel(1)" />
<input type="hidden" name="mode" value="3" />
<input type="hidden" name="fid" value="3168935269" />
<input type="submit" value="Save" class="bluePopupButton marginT15 round5 f12 floatLeft marginR10" />
<input type="button" value="Cancel" class="grayPopupButton marginT15 round5 f12 floatLeft" onclick="quickEditCancel(1)" />
<input type="text" name="newValue" class="xBox" style="width:200px" />
<input type="hidden" name="mode" value="2" />
<input type="hidden" name="fid" value="3168935269" />
<input type="submit" value="Save" class="bluePopupButton marginT15 round5 f12 floatLeft marginR10" />
<input type="button" value="Cancel" class="grayPopupButton marginT15 round5 f12" onclick="quickEditCancel(1)" />
<input type="text" name="newValue" class="xBox" style="width:330px" onkeypress="return quickEditIsValidCharForFileName(event);" />
<input type="hidden" name="mode" value="10" />
<input type="hidden" name="fid" value="3168935269" />
<input type="submit" value="Save" class="bluePopupButton marginT15 round5 f12 floatRight marginL10" />
<input type="button" value="Cancel" class="grayPopupButton marginT15 round5 f12 floatRight" onclick="quickEditCancel();" />
<input type="hidden" name="mode" value="3" />
<input type="hidden" name="did" value="0" />
<input type="submit" value="Save" class="bluePopupButton marginT15 round5 f12 floatLeft marginR10" />
<input type="button" value="Cancel" class="grayPopupButton marginT15 round5 f12 floatLeft" onclick="quickEditCancel(1)" />
<input type="text" name="searchName" style="width:250px;padding:1px 0" class="ajax-suggestion field gaClick" data-element="fs1" autocomplete="off" />
<input type="submit" name="submitButton" value="Search" class="button gaClick" data-element="fs3" />
<input type="hidden" name="searchmode" value="2" />
Using try.jsoup.com also did not yield these input types like Chrome which suggests that it is not my code but rather Jsoup.
Reading through other threads suggest that Javascript may be changing the html after loading the webpage. There were no viable answers on how to fix this.
What am I doing wrong and how do I fix it?
This is my code for getting the full html page:
Document doc = Jsoup.connect("http://www.4shared.com/get/i-EbooI0/batman_hd.html").timeout(0).get();
System.out.println(doc.toString() + "\n\n\n\n");
Elements links = doc.select("input[type=hidden]");
for (org.jsoup.nodes.Element link : links) {
System.out.println(link);
}
View Screenshot of needed values here
SOLUTION
Connection.Response response = Jsoup.connect("myUrl")
.method(Connection.Method.GET)
.execute();
Document homePage = Jsoup.connect("myUrl")
.cookies(response.cookies())
.get();
Modified version of code described here: Jsoup Cookies for HTTPS scraping. This gets the cookies as suggested by Niranjan and then reconnects to your Url.