Using Jsoup in Hadoop

Question

I am trying to parse html files for a Hadoop job. What I would like to do is to strip the file of all html tags to get only the text. The file contains several html pages which were obtained by a crawler. I tried regular expressions but they are not the best tool to parse html and I would like to use JSoup.

Has anyone used JSoup in Hadoop? How did you use libjars to get the jar file into hadoop vm via the command line?

I am using Cloudera's vm – user3011727 Dec 03 '13 at 06:19 — user3011727, Dec 03 '13 at 06:19

score 1 · Answer 1 · answered Dec 12 '13 at 07:12

I've found it's easiest to use Maven to bundle up all the dependencies into one giant jar for deployment across the cluster.

Basically, follow this: http://maven.apache.org/guides/getting-started/maven-in-five-minutes.html

Then, add the "plugin" that compiles all your dependencies into a single jar:

<plugin>
   <artifactId>maven-assembly-plugin</artifactId>
   <version>2.4</version>
   <configuration>
      <descriptorRefs>
         <descriptorRef>jar-with-dependencies</descriptorRef>
      </descriptorRefs>
   </configuration>
   <executions>
      <execution>
         <id>make-assembly</id>
         <phase>package</phase>
         <goals>
            <goal>single</goal>
         </goals>
      </execution>
   </executions>
</plugin>

When you do "mvn package" you'll get a "-jar-with-dependencies" which contains all your libraries and whatever else they need to run.

JSoup is easily available via http://mvnrepository.com/artifact/org.jsoup/jsoup and will work fine within Hadoop. I've used JSoup in pig UDFs, the only problem I had was that it was by far the most cpu intensive part of my jobs.

hey thanks for your answer ... I did use jsoup and here is the resource that helped http://grepalex.com/2013/02/25/hadoop-libjars/ — user3011727, Dec 14 '13 at 01:24
ok, well, I still suggest learning Maven. I started out with that libjars stuff and I've no idea why that blogger says it's "elegant", cf http://stackoverflow.com/questions/11479600/how-do-i-build-run-this-simple-mahout-program-without-getting-exceptions — dranxo, Dec 14 '13 at 01:30
thanks rcompton, it is a goal I have for my JTerm break! do you have a suggestion to a starting place - I have a seven week window to learn — user3011727, Dec 14 '13 at 02:01

Using Jsoup in Hadoop

1 Answers1