1

I am using nutch 2.1 and crawling a site. The problem is that the crawler keeps showing fetching url spinwaiting/active and since the fetching takes so much time the connection to mysql gets timedout. How can i reduce the number of fetches at a time so that the mysql does not get timedout?? Is there a setting in nutch where i can say only fetch 100 or 500 urls then parse and store to mysql and then again fetch the next 100 or 500 urls??

Error message:

Unexpected error for http://www.example.com
java.io.IOException: java.sql.BatchUpdateException: The last packet successfully received from the server was 36,928,172 milliseconds ago.  The last packet sent successfully to the server was 36,928,172 milliseconds ago. is longer than the server configured value of 'wait_timeout'. You should consider either expiring and/or testing connection validity before use in your application, increasing the server configured values for client timeouts, or using the Connector/J connection property 'autoReconnect=true' to avoid this problem.
    at org.apache.gora.sql.store.SqlStore.flush(SqlStore.java:340)
    at org.apache.gora.mapreduce.GoraRecordWriter.write(GoraRecordWriter.java:65)
    at org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.write(ReduceTask.java:587)
    at org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
    at org.apache.nutch.fetcher.FetcherReducer$FetcherThread.output(FetcherReducer.java:663)
    at org.apache.nutch.fetcher.FetcherReducer$FetcherThread.run(FetcherReducer.java:534)
Caused by: java.sql.BatchUpdateException: The last packet successfully received from the server was 36,928,172 milliseconds ago.  The last packet sent successfully to the server was 36,928,172 milliseconds ago. is longer than the server configured value of 'wait_timeout'. You should consider either expiring and/or testing connection validity before use in your application, increasing the server configured values for client timeouts, or using the Connector/J connection property 'autoReconnect=true' to avoid this problem.
    at com.mysql.jdbc.PreparedStatement.executeBatchSerially(PreparedStatement.java:2028)
    at com.mysql.jdbc.PreparedStatement.executeBatch(PreparedStatement.java:1451)
    at org.apache.gora.sql.store.SqlStore.flush(SqlStore.java:328)
    ... 5 more
Caused by: com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: The last packet successfully received from the server was 36,928,172 milliseconds ago.  The last packet sent successfully to the server was 36,928,172 milliseconds ago. is longer than the server configured value of 'wait_timeout'. You should consider either expiring and/or testing connection validity before use in your application, increasing the server configured values for client timeouts, or using the Connector/J connection property 'autoReconnect=true' to avoid this problem.
    at sun.reflect.GeneratedConstructorAccessor49.newInstance(Unknown Source)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:525)
    at com.mysql.jdbc.Util.handleNewInstance(Util.java:411)
    at com.mysql.jdbc.SQLError.createCommunicationsException(SQLError.java:1116)
    at com.mysql.jdbc.MysqlIO.send(MysqlIO.java:3364)
    at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:1983)
    at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2163)
    at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2624)
    at com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:2127)
    at com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:2427)
    at com.mysql.jdbc.PreparedStatement.executeBatchSerially(PreparedStatement.java:1980)
    ... 7 more
Caused by: java.net.SocketException: Broken pipe
    at java.net.SocketOutputStream.socketWrite0(Native Method)
    at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109)
    at java.net.SocketOutputStream.write(SocketOutputStream.java:153)
    at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
    at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
    at com.mysql.jdbc.MysqlIO.send(MysqlIO.java:3345)
    ... 13 more
Kalle Richter
  • 8,008
  • 26
  • 77
  • 177
peter
  • 3,411
  • 5
  • 24
  • 27

1 Answers1

1

I am using nutch 2.1 and crawling a site. The problem is that the crawler keeps showing fetching url spinwaiting/active and since the fetching takes so much time the connection to mysql gets timedout. How can i reduce the number of fetches at a time so that the mysql does not get timedout?

For reducing number of fetches, you can add the property below to your nutch-site.xml and edit the value based on your need. Please do not edit the nutch-default.xml rather copy the property to nutch-site.xml and manage the value from there:

  <property>
    <name>fetcher.threads.fetch</name>
    <value>20</value>
  </property>

Regarding the timeout issue, you can possible add this property to your nutch-site.xml with a value of loading time you think is needed.

<property>
  <name>http.timeout</name>
  <value>240000</value>
  <description>The default network timeout, in milliseconds.</description>
</property>

Is there a setting in nutch where i can say only fetch 100 or 500 urls then parse and store to mysql and then again fetch the next 100 or 500 urls?

Nutch crawls in a cycle with steps - generate/fetch/parse/update in a number of iterations called 'depth' which you specify in your crawl command. If you would like to have a control on your crawling, you can perform each step as described in section 3.2(Using Individual Commands for Whole-Web Crawling) of the tutorial link http://wiki.apache.org/nutch/NutchTutorial. This will give you good direction and understand exactly what is happening. Do check status while fetching each segment so you will know how many urls are being fetched in each segment

Kalle Richter
  • 8,008
  • 26
  • 77
  • 177
sunskin
  • 1,620
  • 3
  • 25
  • 49