I've found it's easiest to use Maven to bundle up all the dependencies into one giant jar for deployment across the cluster.
Basically, follow this: http://maven.apache.org/guides/getting-started/maven-in-five-minutes.html
Then, add the "plugin" that compiles all your dependencies into a single jar:
<plugin>
<artifactId>maven-assembly-plugin</artifactId>
<version>2.4</version>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
<executions>
<execution>
<id>make-assembly</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
When you do "mvn package" you'll get a "-jar-with-dependencies" which contains all your libraries and whatever else they need to run.
JSoup is easily available via http://mvnrepository.com/artifact/org.jsoup/jsoup and will work fine within Hadoop. I've used JSoup in pig UDFs, the only problem I had was that it was by far the most cpu intensive part of my jobs.