2

I am struggling to get the Absolute paths for the images that I am scraping from my website. I have looked at the documentation on jsoup.org but I cannot get the abs:src to work. I don't know how to implement the abs:src or where to add it.

<cfhttp method="get" url="https://theculturecook.com/recipe-slowroasted-pork-belly.html" result="theresult">        
<cfscript>
    Jsoup = createObject("java", "org.jsoup.Jsoup");
    html = "#theresult.filecontent#";
    doc = Jsoup.parse(html);
    tags = doc.select("img[src$=.jpg]");
</cfscript>
<cfset images = "">
<cfloop index="e" array="#tags#">
    <cfoutput>
       <cfset images = ListAppend(images,#e.attr("src")#)>
    </cfoutput>
</cfloop>
<cfloop list="#images#" index="a">
    <cfoutput>#a#<br></cfoutput>
</cfloop>

1 Answers1

3

The issue you are facing is that you are passing html content to JSOUP. If you need absolute paths, then you need to use to following to connect.

Jsoup.connect("https://theculturecook.com/recipe-slowroasted-pork-belly.html").get();

So finally,

<cfscript>
    Jsoup = createObject("java", "org.jsoup.Jsoup");
    doc = Jsoup.connect("https://theculturecook.com/recipe-slowroasted-pork-belly.html").get();
    tags = doc.select("img[src$=.jpg]");
</cfscript>
<!--- <cfdump var="#a.attr()#" abort> --->
<cfset images = "">
<cfloop index="e" array="#tags#">
    <cfoutput>
       <cfset images = ListAppend(images, e.attr("abs:src"))>
    </cfoutput>
</cfloop>
<cfloop list="#images#" index="a">
    <cfoutput>#a#<br></cfoutput>
</cfloop>
rrk
  • 15,677
  • 4
  • 29
  • 45
  • 2
    Good point. JSoup needs a 'base' url in order to resolve relative paths within the content. Using JSoup to grab the URL directly provides that context. It could also be done manually using the overloaded method: `JSoup.parse( html, baseURL)` – SOS May 09 '20 at 21:28