0

I'm trying to make a matrix arithmetic operation method using multidimensional arrays ([verybigrow][2]). I'm new at this, and I just can't find what I'm doing wrong. I'd really appreciate any help in telling me what it is.

    try {
        Stream<String> Matrix = Files.lines(Paths.get(file)).parallel();
        String[][] DataSet = Matrix.map(mapping -> mapping.split(",")).toArray(String[][]::new);
        Double[][] distanceTable = new Double[DataSet.length - 1][];

        /* START WANT TO REPLACE THIS MATRIX CALCULATION WITH PARALLEL STREAM RATHER THAN USE TRADITIONAL ARRAY ARITHMETICS START  */

        for (int i = 0; i < distanceTable.length - 1; ++i) {
            distanceTable[i] = new Double[i + 1];
            for (int j = 0; j <= i; ++j) {
                double distance = 0.0;
                for (int k = 0; k < DataSet[i + 1].length; ++k) {
                    double difference = Double.parseDouble(DataSet[j][k]) - Double.parseDouble(DataSet[i + 1][k]);
                    distance += difference * difference;
                }
                distanceTable[i][j] = distance;
            }
        }

        /* END WANT TO REPLACE THIS MATRIX CALCULATION WITH PARALLEL STREAM RATHER THAN USE TRADITIONAL ARRAY ARITHMETICS START  */

        } catch ( Exception except ){
            System.out.println ( except );
        }

I had rather not use libraries or anything like that, I'm mostly doing this to learn how it works. Thank you so much in advance. if you asking the data looks like :

4,53
5,63
10,59
9,77
13,49

The Output of data processing should look like this :

[101] <- ((4-5)^2) + ((53-63)^2)
[72, 41] <- ( ((4-10)^2) + ((53-59)^2) ), ( ((5,10)^2) + ((63-59)^2))
[601.0, 212.0, 325.0]
[97.0, 260.0, 109.0, 800.0]
[337.0, 100.0, 109.0, 80.0, 400.0]
Fahim Bagar
  • 798
  • 7
  • 17
Bayu Dwiyan Satria
  • 986
  • 12
  • 28
  • 3
    What is happening that is not what you expect? – OrdoFlammae Oct 28 '19 at 12:19
  • yes, that program is work if run in small data. But got memory heap error if in biggest data over 1 million – Bayu Dwiyan Satria Oct 28 '19 at 12:21
  • I can't see `multithreading`, `parallel-processing`, `bigdata` in your code. Please don't add unnecessary tags. – Mushif Ali Nawaz Oct 28 '19 at 12:42
  • Thank you for reviewing my questions. I appreciate it. – Bayu Dwiyan Satria Oct 28 '19 at 12:51
  • 1
    First of all: learn about java naming conventions. Variable and field name go camelCase. Then: dont start with parallel streams. Just use normal streams, if at all. Solve your problem the most straight forward simple way. And only when that works, and you **understand** all the things you are doing, then start looking into the next steps. And then: please tell us what your matrix calculation is supposed to result in. Dont throw your code at us so that we identify what exactly you try to do there. Tell us. – GhostCat Oct 28 '19 at 13:05
  • what i suppose to do is subtraction for each data inside of stream using lambda, rather then i've to store it into new variable. cause it took more memory reserved. – Bayu Dwiyan Satria Oct 28 '19 at 13:11
  • What is `matrixDistance`? – Nikolas Charalambidis Jun 28 '20 at 18:46
  • `double[][] distanceTable = distanceTable double[DataSet.length - 1][];` doesn't compile. – Nikolas Charalambidis Jun 28 '20 at 18:46
  • @Nikolas Thank you for reply. `matrixDistance` is distance between two vector Point. Distance calculation may be euclidean distance or manhattan distance – Bayu Dwiyan Satria Jun 28 '20 at 18:48
  • How do you define it? We have no clue what is "distance between two vector Point"? Please, provide a minimal and valid snippet. – Nikolas Charalambidis Jun 28 '20 at 18:51
  • @Nikolas https://www.geeksforgeeks.org/program-calculate-distance-two-points/ – Bayu Dwiyan Satria Jun 28 '20 at 18:59
  • Do you want me to fogure `matrixDistance` out by myself? My point is that I cannot run your code as long as `matrixDistance` variable remains not initialized. It's up to you to provide the implementation. Otherwise nobody can run your code and you get no answers. – Nikolas Charalambidis Jun 28 '20 at 19:01
  • You have missed off some code so what is here does not compile - for example what is matrixDistance set to? But you will have to re-design how your application works: if I assume that `double[][]distanceTable` is defined as one million array in first dimension, each of 4 bytes to store a reference to an array sized from 1..999,999 in the other, each of those double uses 8 bytes - then you'd need a collossal amount of memory (to store just that alone even without considering the memory for Matrix and Dataset. – DuncG Jun 29 '20 at 15:27
  • When I see that people aim at speeding up some code with parallelization, and then see that this code involves file IO and `double` parsing, I shudder. Even with a single thread and no streams you can do pretty fast matrix stuff with Java, if you do it *right*. With multiple threads, it may be tricky, but http://www.cs.utexas.edu/ftp/techreports/tr95-13.pdf should be a good starter. – Marco13 Jun 29 '20 at 18:26
  • @Marco13 I didn't think so, for Java+8 it's support stream parallel, After browse some paper and take a look on Apache project especially in ApacheSpark and ApacheHadoop they used same methods to make faster processing data. Unfortunately the projects didn't support java 11+. – Bayu Dwiyan Satria Jun 29 '20 at 18:38
  • 1
    \*sigh\* Yeah, sure. Convert your `String DataSet[][]` into a `Double DataSet[][]` first, and then use the *values* for your computation, instead of calling `parseDouble` *millions* of times. You'll see that the computation is ~10 times faster, *without* parallelization. If you want an efficient parallel solution, the current answer is certainly not the best approach either. – Marco13 Jun 29 '20 at 23:46
  • @Marco13 Thank you, of course i'll improve the answers below. anw thanks for the insight – Bayu Dwiyan Satria Jun 30 '20 at 15:10

2 Answers2

1

I try to change matrixDistance with distanceTable. Try to move this code into different method so you can run it parallel

        for(int i = 0; i < matrixDistance.length - 1; ++i) {
            distanceTable[i] = new double[i + 1];
            for(int j = 0; j <= i; ++j) {
                double distance = 0.0;
                for(int k = 0; k < DataSet[i+1].length; ++k) {
                    double difference = Double.parseDouble(DataSet[j][k]) - Double.parseDouble(DataSet[i+1][k]);
                    distance += difference * difference;
                }
                distanceTable[i][j] = distance;
            }
        }

I've created this example based on your question.

    public void parallel(String file)
    ....
    // parsing from csv into matrix 2d Double[][]
    ....
        IntStream
            .range(1, data.length - 1)
            .parallel()
            .forEach(i -> {
                add(euclidian.euclidian(Arrays.copyOf(data, i+1)), i);
            });
}

This is the mini version of your algorithm.

    public Double[] euclidian(Double[][] data) {
        Double[] result = new Double[data.length - 1];
        for (int i = 0; i < result.length; i++) {
            result[i] =
                    Math.pow(data[i][0] - data[data.length - 1][0], 2) +
                            Math.pow(data[i][1] - data[data.length - 1][1], 2);
        }

        return result;
    }

And because of parallel execution, you need to add locking method for insert data into distanceTable.

    private final Object lock = new Object();
    Double[][] distanceTable;

    void add(Double[] data, int index){
        synchronized (lock) {
            distanceTable[index - 1] = data;
        }
    }

I've tested it in my laptop, for 74 row in csv file the comparison is like this (ORI is using your code, PAR is using my approach):

java -jar target/stream-example-1.0-SNAPSHOT.jar test.csv 
#####################
ORI read: 59 ms
ORI  map: 71 ms
ORI time: 80 ms
#####################
PAR read: 0 ms
PAR  map: 6 ms
PAR time: 11 ms

Hope it helps.

Fahim Bagar
  • 798
  • 7
  • 17
  • Thank you for reviews the question. Could u share code snippet run? or repository code so i could test it by myself with my actual data. – Bayu Dwiyan Satria Jun 29 '20 at 18:24
  • Check my [repository](https://github.com/fahimbagar/stream-example). You're welcome. – Fahim Bagar Jun 29 '20 at 18:31
  • Thank you fahim, i've test your code. And it's work with [Ruspini Dataset](https://rstudio-pubs-static.s3.amazonaws.com/548130_7f9994cad80b4efe829c260a6bb30019.html). The speed ratio is very far. Thank you very much. Hope this help on my research. – Bayu Dwiyan Satria Jun 29 '20 at 18:41
  • I'm afraid the test results are misleading as your test program only runs the timings once on 7 lines, and JIT will make a massive difference for larger data sets. Whichever of PAR or ORI timings are run first takes the longest time - try swapping over the sections calling `main.parallel(file)` and `main.original(file)`. On my old PC (2 CORE) the ORI code is always quicker than the parallel when you compare the times of the version executed last. It would be interesting to see at which point the parallel version become the fastest with larger test data. – DuncG Jun 30 '20 at 14:58
  • Here are my second run timings: ORI read: 1 ms, ORI map: 4 ms, ORI time: 5 ms compare to PAR read: 0 ms, PAR map: 8 ms, PAR time: 13 ms. – DuncG Jun 30 '20 at 15:12
  • here is my test; 1049 lines of csv file. ORI read: 57 ms; map: 78 ms; time: 238 ms. PAR read: 57 ms; map: 92 ms; time: 127 ms – Fahim Bagar Jun 30 '20 at 15:37
1

@Fahim Bagar answer example should run faster with big data sets, but you should improve your single thread code before making hasty decisions about timing metrics compared to parallel.

For example, removing wasteful Double.parseDouble is easy with code example provided by @Fahim Bagar swapping String[][] DataSet by Double[][] DataSet

//String[][] DataSet = Matrix.map(mapping -> mapping.split(",")).toArray(String[][]::new);
Double[][] DataSet = Matrix.map(row -> Arrays.stream(row.split(",")).map(Double::parseDouble).toArray(Double[]::new)).toArray(Double[][]::new);

Then take various array references for DataSet[i + 1] and DataSet[j] to local variables outside their loops:

for (int i = 0; i < distanceTable.length - 1; ++i) {
    Double[] arriplus1 = new Double[i + 1];
    Double[] iarr = DataSet[i + 1];
    for (int j = 0; j <= i; ++j) {
        double distance = 0.0;
        Double[] jarr = DataSet[j];
        for (int k = 0, sz = iarr.length; k < sz; ++k) {
            double difference = jarr[k] - iarr[k];
            distance += difference * difference;
        }
        arriplus1[j] = distance;
    }
    distanceTable[i] = arriplus1;
}

You can do same for @Fahim Bagar euclidian method

public Double[] euclidian(Double[][] data) {
    Double[] result = new Double[data.length - 1];
    Double[] dL1 = data[data.length - 1];
    for (int i = 0; i < result.length; i++) {
        Double[] di = data[i];
        result[i] = Math.pow(di[0] - dL1[0], 2) + Math.pow(di[1] - dL1[1], 2);
    }
    return result;
}

After that, getting rid of Double and using double would speed up further / cut down on memory allocations.

On CSV rows 1048 I see these timings on the 10th run of each:

#####################
ORI read: 0 ms
ORI  map: 4 ms
ORI time: 14 ms
#####################
PAR read: 0 ms
PAR  map: 1 ms
PAR time: 10 ms
DuncG
  • 12,137
  • 2
  • 21
  • 33