0

I am having an issue with using JSoup in that it is giving me a malformed URL error. If I hardcode the URL into the program it works fine but if I read a csv file into a List<String[]> and then loop each of the values in the list it fails. For example if I hardcode http://www.clubmark.org.uk/ into the program it works fine, but if I read it from the csv into the List<String[]> it fails.

The stack trace is

Exception in thread "restartedMain" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.springframework.boot.devtools.restart.RestartLauncher.run(RestartLauncher.java:49)
Caused by: java.lang.IllegalArgumentException: Malformed URL: http://www.clubmark.org.uk/
    at org.jsoup.helper.HttpConnection.url(HttpConnection.java:131)
    at org.jsoup.helper.HttpConnection.connect(HttpConnection.java:70)
    at org.jsoup.Jsoup.connect(Jsoup.java:73)
    at com.domainModel.DownloadImages.findImages(DownloadImages.java:43)
    at com.workingprojects.WebScraperApplication.main(WebScraperApplication.java:40)

My main class is

@SpringBootApplication
@EntityScan({"com.bootstrap","com.domainModel"})
@ComponentScan({"com.bootstrap","com.domainModel"})
public class WebScraperApplication {

    public static void main(String[] args) throws IOException, CsvException {
        SpringApplication.run(WebScraperApplication.class, args);
        
        DownloadImages downloadImages = new DownloadImages();


        
        ReadCSV readCSV = new ReadCSV();
        ArrayList<String[]> urls = (ArrayList<String[]>) readCSV.csvReader("C:\\link1.csv");
    

        for (int i = 0; i < 1; i++) {     
            String[] thisURLObject = urls.get(0);
            String thisURL =thisURLObject[0];
            String status = downloadImages.findImages(thisURL, "C:\\Users\\xxx\\images");
            System.out.println(thisURL + status);
            
            
            }
        
        
;
        System.out.println("finished");
        
    }

}

My class which gets the images and where the issue is seen is

package com.domainModel;


import org.jsoup.Jsoup;






public class DownloadImages {
    
    
    
     //The url of the website.
    @Getter @Setter
    private String webSiteURL;



//The path of the folder that you want to save the images to
@Getter @Setter
private  String folderPath;
 
public String findImages(String webSiteURL, String folderPath ) {
 
    try {
 
        //Connect to the website and get the html
        Document doc = Jsoup.connect(webSiteURL).get();
        
 
        //Get all elements with img tag ,
        Elements img = doc.getElementsByTag("img");
       System.out.println("Images is" + img.size());
       
 
       String folderNameWk2 = webSiteURL.replace(".html", "");
       String folderNameWk3 = folderNameWk2.replace("http://", "");
     
       Path path = Paths.get(folderPath + folderNameWk3);
       Files.createDirectories(path);
       String path1 = path.toString();
       System.out.println("The path is " + path1);
       
       
       int counter = 0;
 
        for (Element el : img) {
            
            
            
            String docName = String.valueOf(counter)+".jpeg";
 
            //for each element get the srs url
            String src = el.absUrl("src");
 
            System.out.println("Image Found!");
            System.out.println("src attribute is : "+src);
            getImages(src, path1, docName);
     
            counter = counter+1;
 
        }
 
    } catch (IOException ex) {
        
        System.err.println("There was an error");
        System.out.println(ex);
    //    Logger.getLogger(DownloadImages.class.getName()).log(Level.SEVERE, null, ex);
    }
    
    return "complete";
}



    private void getImages(String src, String folderPath, String docName) throws IOException {
 
     //   String folder = null;
 
        //Exctract the name of the image from the src attribute
        int indexname = src.lastIndexOf("/");
 
        if (indexname == src.length()) {
            src = src.substring(1, indexname);
        }
 
        indexname = src.lastIndexOf("/");
        String name = src.substring(indexname, src.length());
 
        System.out.println(name);
 
        //Open a URL Stream
        URL url = new URL(src);
        InputStream in = url.openStream();
 
        OutputStream out = new BufferedOutputStream(new FileOutputStream(folderPath+"/" + docName));
 
        for (int b; (b = in.read()) != -1;) {
            out.write(b);
        }
        out.close();
        in.close();
 
    }

    /**
     * @param webSiteURL
     * @param folderPath
     */
    public DownloadImages(String webSiteURL, String folderPath) {
        super();
        this.webSiteURL = webSiteURL;
        this.folderPath = folderPath;
    }

    /**
     * 
     */
    public DownloadImages() {
        super();
    }
    
    
}


And the class which gets the CSV file is 

    package com.domainModel;



public class ReadCSV {
    

    
    public List<String[]> csvReader(String fileName) throws IOException, CsvException{

           
        try (CSVReader reader = new CSVReader(new FileReader(fileName))) {
            List<String[]> r = reader.readAll();
     
            
            return r;
            

    
}
}
}

My class which reads in the CSV

public class ReadCSV {
    

    
    public List<String[]> csvReader(String fileName) throws IOException, CsvException{

           
        try (CSVReader reader = new CSVReader(new FileReader(fileName))) {
            List<String[]> r = reader.readAll();
     
            
            return r;
            

    
}
}
}

I am reasonably certain the issue is with the format of what I am passing from the list but when I look at the values they certainly seem to be Strings

First two rows of csv file

http://www.clubmark.org.uk/, http://www.designit-uk.com/,

Image of the first two rows of data in notepad

image of 1st 2 rows of csv

  • 2
    That's a lot of code. Any chance you could [edit] the question and bring the code down to a [mcve]? – Robert Jan 05 '21 at 16:56
  • 1
    Some CSV data would also be needed, to ensure you really do have a [mre]. – andrewJames Jan 05 '21 at 16:58
  • thanks for your help. I have removed imports from the code and some of the comments. Added image of csv file and first 2 rows of data – HuddersfieldLad Jan 05 '21 at 18:37
  • 1
    Thank you for the updates. Your code works fine for me - it does not throw a Malformed URL exception. JSoup loads the web page for the clubmark URL correctly. So, this means I still do not have the "reproducible" part of a [mre]. – andrewJames Jan 05 '21 at 19:20
  • 1
    As an aside,I notice you are using Notepad with your text file. I recommend you don't use Notepad because it inserts hidden [byte order marks](https://en.wikipedia.org/wiki/Byte_order_mark) at the start of the file. I don't think that is the cause of your specific problem, but I would still recommend Notepad++ or another text editor instead. – andrewJames Jan 05 '21 at 19:20
  • @andrewjames I think the issue only occurs with data read in from a csv file. It works fine when I just hardcode a url in. I think the only way to recreate it is to upload a csv file into the program. I am not sure how to upload a copy of the CSV file onto this site but it is a simple CSV file with 1 column containing multiple urls. Note I created the csv using excel (doubt it is relevant but it may be). Thanks for the advice on notepad – HuddersfieldLad Jan 05 '21 at 19:32
  • Ah - as a test try creating a plain text file using Notepad++ (which is what I did - and and your code worked). Don't use Excel to create the CSV file. See if that works. Excel (like Notepad and other MS tools) tends to also add BOMs to the start of the file - so maybe that is interfering with the URL parser. – andrewJames Jan 05 '21 at 19:42
  • Then, if that helps, take a look at [this](https://stackoverflow.com/questions/21891578/removing-bom-characters-using-java), or see if OpenCSV can be configured to handle (i.e. ignore) BOMs. – andrewJames Jan 05 '21 at 19:45
  • thanks @andrewjames james it was the BOM character . For interest if anyone gets the problem again open the CSV in notepad++ and look at the bottom right corner. If it says UTF-8-BOM then the file contains BOM character. To remove BOM character, go to Encoding and select convert to UTF-8. Save the file and re-try the import – HuddersfieldLad Jan 05 '21 at 20:06
  • Glad it's resolved. You are welcome to convert the notes in your last comment into an answer. You can even mark the answer as accepted, if you wish. – andrewJames Jan 05 '21 at 21:25

0 Answers0