3

I have deployed 3 spring boot apps with Hazelcast Jet embedded. The nodes recognize each other and run as a cluster. I have the following code: A simple reading from CSV and write to a file. But Jet writes duplicates to the file sink. To be precise, Jet processes total entries in the CSV multiplied by the number of nodes. So, if I have 10 entries in the source and 3 nodes, I see 3 files in the sink each having all 10 entries. I just want to process one record once and only once. Following is my code:

    Pipeline p = Pipeline.create();

    BatchSource<List<String>> source = Sources.filesBuilder("files")
            .glob("*.csv")
            .build(path -> Files.lines(path).skip(1).map(line -> split(line)));

    p.readFrom(source)
            .writeTo(Sinks.filesBuilder("out").build());
    instance.newJob(p).join();
Oliv
  • 10,221
  • 3
  • 55
  • 76
Rajesh
  • 153
  • 6
  • 3
    you should set `sharedFileSystem` parameter of FileSurceBuilder to true. this will make the source to partition the files to each source processor. – ali Jul 25 '20 at 18:52
  • 1
    Gotcha. Thanks Ali. That worked – Rajesh Jul 25 '20 at 19:12
  • In addition to this, Suppose I have a huge file in a shared file system, how can I process that csv file among multiple nodes parallely? Right now only one node is picking that huge file. Rest of the nodes are silent. – Rajesh Jul 30 '20 at 14:20
  • You can use `rebalance` feature if your source is operating on only one member, as is the case with you since you have only a single file in the directory. The rebalance feature is introduced in 4.2. – ali Aug 01 '20 at 08:25

1 Answers1

2

If it's a shared file system, then sharedFileSystem attribute in FilesourceBuilder must be set to true.

Oliv
  • 10,221
  • 3
  • 55
  • 76
Rajesh
  • 153
  • 6