3

Recently, I want to parse websites and then use BeautifulSoup to filter what I want and write in csv file in hdfs.

Now, I am at the process of filtering website code with BeautifulSoup.

I want to use mapreduce method to execute it:

hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.3.0-mr1-cdh5.0.2.jar 
-mapper /pytemp/filter.py 
-input /user/root/py/input/ 
-output /user/root/py/output40/

input file is like kvs(PER LINE): (key, value) = (url, content)

content, I mean:

<html><head><title>...</title></head><body>...</body></html>

filter.py file:

#!/usr/bin/env python
#!/usr/bin/python
#coding:utf-8
from bs4 import BeautifulSoup
import sys

for line in sys.stdin:
    line = line.strip()
    key, content = line.split(",")

    #if the following two lines do not exist, the program will execute successfully
    soup = BeautifulSoup(content)
    output = soup.find()         

    print("Start-----------------")
    print("End------------------")

BTW, I think I do not need reduce.py to do my work.

However, I got error message:

Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:320)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:533)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)

Here is a reply said it is memory issue but my input file just 3MB. http://grokbase.com/t/gg/rhadoop/13924fs4as/972-getting-error-pipemapred-waitoutputthreads-while-running-mapreduce-program-for-40mb-of-sizedataset

I have no idea about my problem. I search lots of things for it but still does not work.

My environment is:

  1. CentOS6
  2. Python2.7
  3. Cloudera CDH5

I will appreciate your help with this situation.

EDIT on 2016/06/24

First of all, I checked error log and found the problem is too many values to unpack. (also thanks to @kynan answer)

Just give an example why it happened

<font color="#0000FF">
  SomeText1
  <font color="#0000FF">
    SomeText2
  </font>
</font>

If part of content is like above, and I call soup.find("font", color="#0000FF") and assign to output. It will cause two font to be assigned to one output, so that is why the error too many values to unpack

Solution

Just change output = soup.find() to (Var1, Var2, ...) = soup.find_all("font", color="#0000FF", limit=AmountOfVar) and work well :)

Danny
  • 33
  • 1
  • 1
  • 7

1 Answers1

2

This error usually means that the mapper process died. To find out why check the user logs in $HADOOP_PREFIX/logs/userlogs: there is one directory per job and inside one directory per container. In each container directory is a file stderr containing the output sent to stderr i.e. error messages.

kynan
  • 13,235
  • 6
  • 79
  • 81
  • Hi. I am having the same issue described above. How do I access the user logs? – chucknor Jul 03 '16 at 19:41
  • For EMR/yarn you can find your logs from the WEB UI or on the cluster master shell as shown below (your application id will differ it is printed when the jobs starts). There is a lot of output, redirect it into a file as I show and look for python stack traces. $ yarn logs -applicationId application_1503951120983_0031 > /tmp/log – gae123 Aug 31 '17 at 22:03