39

I have a process in Talend which gets the search result of a page, saves the html and writes it into files, as seen here:

enter image description here

Initially I had a two step process with parsing out the date from the HTML files in Java. Here is the code: It works and writes it to a mysql database. Here is the code which basically does exactly that. (I'm a beginner, sorry for the lack of elegance)

package org.jsoup.examples;

import java.io.*;   

import org.jsoup.*;
import org.jsoup.nodes.*;
import org.jsoup.select.Elements;

import java.io.IOException;


public class parse2 {       
    static parse2 parseIt2 = new parse2();
    String companyName = "Platzhalter";
    String jobTitle = "Platzhalter";
    String location = "Platzhalter";
    String timeAdded = "Platzhalter";

    public static void main(String[] args) throws IOException {
        parseIt2.getData();
    }

    // 
    public void getData() throws IOException {
        Document document = Jsoup.parse(new File("C:/Talend/workspace/WEBCRAWLER/output/keywords_SOA.txt"), "utf-8");
        Elements elements = document.select(".joblisting");
        for (Element element : elements) {
            // Parse Data into Elements
            Elements jobTitleElement = element.select(".job_title span");
            Elements companyNameElement = element.select(".company_name span[itemprop=name]");
            Elements locationElement = element.select(".locality span[itemprop=addressLocality]");
            Elements dateElement = element.select(".job_date_added [datetime]");

            // Strip Data from unnecessary tags
            String companyName = companyNameElement.text();
            String jobTitle = jobTitleElement.text();
            String location = locationElement.text();
            String timeAdded = dateElement.attr("datetime");

            System.out.println("Firma:\t"+ companyName + "\t" + jobTitle + "\t in:\t" + location + " \t Erstellt am \t" + timeAdded );
        }  
    }
}

Now I want to do the process End-to-End in Talend, and I got assured this works. I tried this (which looks quite shady to me): enter image description here

Basically I put all imports in "advanced settings" and the code in the "basic settings" section. This importLibrary is thought to load the jsoup parsing library, as well as the mysql connect (i might to the connect with talend tools though).

Obviously this isn't working. I tried to strip the Base Code from classes and stuff and it was even worse. Can you help me how to get the generated .txt files parsed with Java here?

EDIT: Here is the Link to the talend Job http://www.share-online.biz/dl/8M5MD99NR1

EDIT2: I changed the code to the one I tried in JavaFlex. But it didn't work (the import part in the start part of the code, the rest in "body/main" and nothing in "end".

miken32
  • 42,008
  • 16
  • 111
  • 154
ZedBrannigan
  • 601
  • 1
  • 8
  • 18
  • 1
    Check tJAvaflex instead of tjavarow, it will help – Balazs Gunics Jul 24 '14 at 16:08
  • 3
    Can you show your errors as well please? If you could host the zip of your job and all its dependencies then that would make debugging it even easier. Also, I'd probably be inclined to tear that DB code out and use a proper Talend DB connector instead. – ydaetskcoR Jul 24 '14 at 16:18
  • 3
    No problem with the elegance, but next time press Ctrl+Alt+F while on eclipse, in order to format the code :) . – Alkis Kalogeris Jul 24 '14 at 17:43
  • 1
    would you like to share your input file, i want to test it at my end to give you better solution? – UmeshR Jul 25 '14 at 06:50
  • Yes, thank you all for your help. I will share everything. The input file I use to parse is generated in the talend workflow. Link is at the bottom of my question. Im open for any suggestion how to solve this problem in one workflow! I Don't feel I'm doing it the easy way :) – ZedBrannigan Jul 25 '14 at 08:16
  • EDIT2: I changed the code to the one I tried in JavaFlex. But it didn't work (the import part in the start part of the code, the rest in "body/main" and nothing in "end". – ZedBrannigan Jul 25 '14 at 13:29
  • Is there a way to "bump" this, although its impolite? – ZedBrannigan Jul 28 '14 at 07:13
  • If you edit your question and add in more details, that can help it get more attention. – SteveDonie Dec 14 '15 at 20:35
  • 1
    You need to write your custom Java code (with all your import statements) in a **routine** and then use functions defined in that routine either in tMap components directly (if possible) or use that in tJavaFlex components. For more details, see this https://help.talend.com/display/KB/Creating+a+user+routine+and+call+it+in+a+Job – Incognito Jan 24 '16 at 04:22
  • Have you tried debugging or logging to see if the file was imported successfully by `Jsoup.parse`? – Khaled.K Jan 24 '16 at 09:13

1 Answers1

1

This is a problem related to Talend, in your code, use the complete method names including their packages. For your document parsing for example, you can use :

Document document =  org.jsoup.Jsoup.parse(new File("C:/Talend/workspace/WEBCRAWLER/output/keywords_SOA.txt"), "utf-8");
Maouven
  • 330
  • 5
  • 11