-1

I'm still new to Spark. I wrote this code to parse large string list.

I had to use forEachRemaining because I had to initialize some non-serializable objects in each partition.

    JavaRDD<String> lines=initRetriever();
    lines.foreachPartition(iter->{
        Obj1 obj1=initObj1()
        MyStringParser parser=new MyStringParser(obj1);
        iter.forEachRemaining(str->{
            try {
                parser.parse(str);
            } catch (ParsingException e) {
                e.printStackTrace();
            }
        });

        System.out.print("all parsed");
        obj1.close();
    });

I believe Spark is all about parallelism. But this program uses only a single thread on my local machine. Did I do something wrong? Missing configuration? Or maybe the iter doesn't allow it to execute all in parallel.

EDIT

I have no configuration files for Spark.

That's how I initialize Spark

SparkConf conf = new SparkConf()
                .setAppName(AbstractSparkImporter.class.getCanonicalName())
                .setMaster("local");

I run it on IDE and using mvn:exec command.

Mohamed Taher Alrefaie
  • 15,698
  • 9
  • 48
  • 66

1 Answers1

0

As @Alberto-Bonsanto indicated, using local[*] triggers Spark to use all available threads. More info here.

SparkConf conf = new SparkConf()
                .setAppName(AbstractSparkImporter.class.getCanonicalName())
                .setMaster("local[*]");
Mohamed Taher Alrefaie
  • 15,698
  • 9
  • 48
  • 66