1

I am trying to parse html files for a Hadoop job. What I would like to do is to strip the file of all html tags to get only the text. The file contains several html pages which were obtained by a crawler. I tried regular expressions but they are not the best tool to parse html and I would like to use JSoup.

Has anyone used JSoup in Hadoop? How did you use libjars to get the jar file into hadoop vm via the command line?

Stephan
  • 41,764
  • 65
  • 238
  • 329

1 Answers1

1

I've found it's easiest to use Maven to bundle up all the dependencies into one giant jar for deployment across the cluster.

Basically, follow this: http://maven.apache.org/guides/getting-started/maven-in-five-minutes.html

Then, add the "plugin" that compiles all your dependencies into a single jar:

<plugin>
   <artifactId>maven-assembly-plugin</artifactId>
   <version>2.4</version>
   <configuration>
      <descriptorRefs>
         <descriptorRef>jar-with-dependencies</descriptorRef>
      </descriptorRefs>
   </configuration>
   <executions>
      <execution>
         <id>make-assembly</id>
         <phase>package</phase>
         <goals>
            <goal>single</goal>
         </goals>
      </execution>
   </executions>
</plugin>

When you do "mvn package" you'll get a "-jar-with-dependencies" which contains all your libraries and whatever else they need to run.

JSoup is easily available via http://mvnrepository.com/artifact/org.jsoup/jsoup and will work fine within Hadoop. I've used JSoup in pig UDFs, the only problem I had was that it was by far the most cpu intensive part of my jobs.

dranxo
  • 3,348
  • 4
  • 35
  • 48
  • hey thanks for your answer ... I did use jsoup and here is the resource that helped http://grepalex.com/2013/02/25/hadoop-libjars/ – user3011727 Dec 14 '13 at 01:24
  • 1
    ok, well, I still suggest learning Maven. I started out with that libjars stuff and I've no idea why that blogger says it's "elegant", cf http://stackoverflow.com/questions/11479600/how-do-i-build-run-this-simple-mahout-program-without-getting-exceptions – dranxo Dec 14 '13 at 01:30
  • thanks rcompton, it is a goal I have for my JTerm break! do you have a suggestion to a starting place - I have a seven week window to learn – user3011727 Dec 14 '13 at 02:01
  • 1
    follow the link in the answer – dranxo Dec 14 '13 at 03:32