SystemML load file from HDFS?

Question

How can I load a csv file from HDFS in systemML DSL?

I tried some like:

X = read("hdfs://ip-XXX-XXX-XXX-XXX:9000/SystemML/data/NN_X_100_10.csv");

And I checked the file is actually located in this HDFS position.

When I run the dsl by:

 $SPARK_HOME/bin/spark-submit ~/Nearest_Neighbour_Search/SystemML/systemml-0.14.0-incubating.jar -f ~/Nearest_Neighbour_Search/SystemML/Task03_NN_SystemML_1000_hdfs.dml

It complains that:

ERROR:/home/ubuntu/Nearest_Neighbour_Search/SystemML/Task03_NN_SystemML_1000_hdfs.dml -- line 1, column 0 -- Read input file does not exist on FS (local mode): hdfs://ip-172-30-4-168:9000/SystemML/data/NN_X_1000000_1000.csv
        at org.apache.sysml.api.DMLScript.executeScript(DMLScript.java:367)
        at org.apache.sysml.api.DMLScript.main(DMLScript.java:214)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:738)
        at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
        at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
    Caused by: org.apache.sysml.parser.LanguageException: Invalid Parameters : ERROR: /home/ubuntu/Nearest_Neighbour_Search/SystemML/Task03_NN_SystemML_1000_hdfs.dml -- line 1, column 0 -- Read input file does not exist on FS (local mode): hdfs://ip-172-30-4-168:9000/SystemML/data/NN_X_1000000_1000.csv
        at org.apache.sysml.parser.Expression.raiseValidateError(Expression.java:549)
        at org.apache.sysml.parser.DataExpression.validateExpression(DataExpression.java:641)
        at org.apache.sysml.parser.StatementBlock.validate(StatementBlock.java:592)
        at org.apache.sysml.parser.DMLTranslator.validateParseTree(DMLTranslator.java:143)
        at org.apache.sysml.api.DMLScript.execute(DMLScript.java:591)
        at org.apache.sysml.api.DMLScript.executeScript(DMLScript.java:353)
        ... 10 more

I think the problem relates to local mode, but I do not know how to set up to support HDFS for systemML.

Any suggestion is highly appreciated!

Thanks!

score 1 · Accepted Answer · answered Aug 25 '17 at 02:05

You're right, it's related to local mode, more specifically the default file system implementation (i.e., fs.defaultFS in core-site.xml). There was a bug in SystemML 0.14 (and prior versions, see https://issues.apache.org/jira/browse/SYSTEMML-1664) that caused all local reads to use the configured default FS implementation independent of the specified file scheme. The hadoop jars contain default configurations that use local mode and the local filesystem implementation.

You have two options here:

Upgrade: Since this bug has been fixed in SystemML master (and thus any upcoming version), you could simply build from scratch or use an existing snapshot artifact (https://repository.apache.org/content/groups/snapshots/org/apache/systemml/systemml/1.0.0-SNAPSHOT/systemml-1.0.0-20170818.213422-9.jar).
Workaround: As a workaround, you can put your csv file into local file system and simply use the relative or absolute file path in your read statement.

Thanks for your answer! I tried the new build, it worked! – Binhang Yuan Aug 27 '17 at 09:42 — Binhang Yuan, Aug 27 '17 at 09:42

SystemML load file from HDFS?

1 Answers1