6

I am using this piece of code to calculate spark recommendations:

    SparkSession spark = SparkSession
            .builder()
            .appName("SomeAppName")
            .config("spark.master", "local[" + args[2] + "]")
            .config("spark.local.dir",args[4])
            .getOrCreate();
    JavaRDD<Rating> ratingsRDD = spark
            .read().textFile(args[0]).javaRDD()
            .map(Rating::parseRating);
    Dataset<Row> ratings = spark.createDataFrame(ratingsRDD, Rating.class);
    ALS als = new ALS()
            .setMaxIter(Integer.parseInt(args[3]))
            .setRegParam(0.01)
            .setUserCol("userId")
            .setItemCol("movieId")
            .setRatingCol("rating").setImplicitPrefs(true);

    ALSModel model = als.fit(ratings);
    model.setColdStartStrategy("drop");
    Dataset<Row> rowDataset = model.recommendForAllUsers(50);

These are maven dependencies to make this piece of code work:

    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-core_2.11</artifactId>
        <version>2.4.0</version>
    </dependency>
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-sql_2.11</artifactId>
        <version>2.4.0</version>
    </dependency>
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-mllib_2.11</artifactId>
        <version>2.4.0</version>
    </dependency>
    <dependency>
        <groupId>org.scala-lang</groupId>
        <artifactId>scala-library</artifactId>
        <version>2.11.8</version>
    </dependency>

Calculating recommendations with this code takes ~70sec for my data file. This code produces following warning:

WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS
WARN LAPACK: Failed to load implementation from: com.github.fommil.netlib.NativeSystemLAPACK
WARN LAPACK: Failed to load implementation from: com.github.fommil.netlib.NativeRefLAPACK

Now I try to enable netlib-java by adding this dependency in maven:

    <dependency>
        <groupId>com.github.fommil.netlib</groupId>
        <artifactId>all</artifactId>
        <version>1.1.2</version>
        <type>pom</type>
    </dependency>

to avoid crashing of this new environment I had to do this extra trick:

LD_PRELOAD=/usr/lib64/libopenblas.so

Now it also works, it gives no warnings, but it works slower and it takes ~170sec on average to perform the same calculation. I am running this on CentOS.

Shouldn't it be faster with native libraries? Is it possible to make it faster?

Stepan Yakovenko
  • 8,670
  • 28
  • 113
  • 206
  • 1
    I was able to reproduce the warnings. However, I am able to get the result and all the results in the Spark example (Spark docs) within 8 seconds and even with `show()` I got it in 16 seconds. What parameters are using for `setMaxIter()` and master`"local[" + args[2] + "]"`? I am using `10 and 2` respectively. – Nikhil Jan 04 '19 at 14:02
  • 1
    Can you share the dataset may be I am using the smaller one? – Nikhil Jan 04 '19 at 14:06
  • 1
    https://drive.google.com/file/d/16a-U43TDUp8_U3oRG30bq08t51HhZbgd/view – Stepan Yakovenko Jan 04 '19 at 16:25
  • 1
    you can set maxiter to ~100 for example, to get long time running – Stepan Yakovenko Jan 05 '19 at 08:05

1 Answers1

1
  1. first you can check you centos version ,for centos 6 may not using the native libraries, check this

  2. As far as I konw , the ALS algorithm has been improved since 2.0 version , you can check Highlights in 2.2

    And the source code from 2.2 as blow :

    enter image description here

    so the native libraries has not help!

fansy1990
  • 131
  • 1
  • 5