2

I am relatively new to R. I have a large list of voxel coordinates (> 1 GByte) that I read from a LAMMPS dump file and has this structure (this is a small reproducible example):

> lammps_file
V1  V2 V3 V4 V5
1  78  1  1  1
2 163  2  1  1
3 157  3  1  1
4  79  4  1  1
5 238  1  2  1
6 145  2  2  1
7 103  3  2  1
8 108  4  2  1
9 254  1  3  1
10  85  2  3  1
11 219  3  3  1
12 214  4  3  1
13 109  1  4  1
14 237  2  4  1
15 145  3  4  1
16 118  4  4  1
17  28  1  1  2
18 134  2  1  2
19 174  3  1  2
20   2  4  1  2
21 210  1  2  2
22 219  2  2  2
23 138  3  2  2
24 219  4  2  2
25 231  1  3  2
26  53  2  3  2
27 255  3  3  2
28 255  4  3  2
29 157  1  4  2
30 188  2  4  2
31 143  3  4  2
32  85  4  4  2
33  71  1  1  3
34  80  2  1  3
35 164  3  1  3
36 142  4  1  3
37 144  1  2  3
38 194  2  2  3
39 173  3  2  3
40  55  4  2  3
41 115  1  3  3
42   9  2  3  3
43  45  3  3  3
44 103  4  3  3
45  53  1  4  3
46  87  2  4  3
47  37  3  4  3
48  88  4  4  3
49 176  1  1  4
50 127  2  1  4
51   7  3  1  4
52 123  4  1  4
53  97  1  2  4
54   5  2  2  4
55 216  3  2  4
56  37  4  2  4
57  52  1  3  4
58  50  2  3  4
59  63  3  3  4
60 231  4  3  4
61 164  1  4  4
62 101  2  4  4
63 137  3  4  4
64 197  4  4  4

with the x, y, and z voxel coordinates in columns 3:5 and the voxel intensity value in column 2. Column 1 is not used. My goal is to write the voxel intensity values of all (n) coordinates xyz into an array. I am doing this to use the array structure and easily convert it into a stack of tiff images (with EBImage)... For this purpose I use the following code

> lcoordinates <- read.table(lammps_file, ...)
> n <- dim(lcoordinates)

> varray <- array(0, c(number_voxels_in_x, number_voxels_in_y, number_voxels_in_z))
>
>   for (j in 1:n[1]) {
>        v <-  as.numeric(lcoordinates[j,2:5])
>        varray[v[2], v[3], v[4]] <- v[1]
>   }

The result should look like this:

> varray

, , 1

     [,1] [,2] [,3] [,4]
[1,]   78  238  254  109
[2,]  163  145   85  237
[3,]  157  103  219  145
[4,]   79  108  214  118

, , 2

     [,1] [,2] [,3] [,4]
[1,]   28  210  231  157
[2,]  134  219   53  188
[3,]  174  138  255  143
[4,]    2  219  255   85

, , 3

     [,1] [,2] [,3] [,4]
[1,]   71  144  115   53
[2,]   80  194    9   87
[3,]  164  173   45   37
[4,]  142   55  103   88

, , 4

     [,1] [,2] [,3] [,4]
[1,]  176   97   52  164
[2,]  127    5   50  101
[3,]    7  216   63  137
[4,]  123   37  231  197

It works, but takes an enormous amount of time (> 3h for 1*10^9 voxels or rows). Is there a way to code this faster? Thanks!

Martin R.
  • 21
  • 2
  • 1
    It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. We don't need your actual data, just something to test with. – MrFlick Mar 10 '20 at 15:31
  • Try instead the for loop maybe something like: `varray[as.matrix(lcoordinates[3:5])] <- lcoordinates[,2]` . – GKi Mar 10 '20 at 15:48
  • without information of what will you do afterwards with the data I am not sure if it would be possible, but I would suggest to not use your desired structure, because it probably will be slower and harder to process the data – minem Mar 11 '20 at 07:27
  • @MrFlick: I gave a simple example with input and output data, hope this helps... – Martin R. Mar 11 '20 at 16:14
  • 1
    If your data is sorted like that, then you can just do `array(lammps_file$V2, c(4,4,4))` and the array will be created in the correct order. – MrFlick Mar 11 '20 at 16:18
  • @GKi: tried your approach like this `varray[as.matrix(unlist(lcoordinates[,3:5]))] <- lcoordinates[,2]` but what happens is that only the first column of `varray` is filled? – Martin R. Mar 11 '20 at 16:19
  • @MrFlick I will certainly try that! – Martin R. Mar 11 '20 at 16:20
  • @All Thank you very much for your help on my first R question here. Very much appreciated! – Martin R. Mar 12 '20 at 09:25

2 Answers2

0

You can use a matrix index for subsetting like:

varray <- array(0, c(max(lcoordinates[,3]), max(lcoordinates[,4])
   , max(lcoordinates[,5])))
varray[as.matrix(lcoordinates[3:5])] <- lcoordinates[,2]
varray
, , 1

     [,1] [,2] [,3] [,4]
[1,]   78  238  254  109
[2,]  163  145   85  237
[3,]  157  103  219  145
[4,]   79  108  214  118

, , 2

     [,1] [,2] [,3] [,4]
[1,]   28  210  231  157
[2,]  134  219   53  188
[3,]  174  138  255  143
[4,]    2  219  255   85

, , 3

     [,1] [,2] [,3] [,4]
[1,]   71  144  115   53
[2,]   80  194    9   87
[3,]  164  173   45   37
[4,]  142   55  103   88

, , 4

     [,1] [,2] [,3] [,4]
[1,]  176   97   52  164
[2,]  127    5   50  101
[3,]    7  216   63  137
[4,]  123   37  231  197

In case it is already correct sorted @MrFlick answer in the comment is faster.

GKi
  • 37,245
  • 2
  • 26
  • 48
  • Thank you very much, your solution also works even if coordinates with 0 voxel intensity are omitted (not explicitely listed) in the `lammps_file`... – Martin R. Mar 12 '20 at 09:24
0

If your data is already organised in the order it needs to be, then by far the quickest way is to use dim<-. In your example:

## sort into correct order (maybe this already happened)
lammps_file <- lammps_file[order(lammps_file$V3,lammps_file$V4,lammps_file$V5)]
## create a new vector
y <- lammps_file$V2
## now just assign it a dimension
dim(y) <- c(n1,n2,n3)

(if your array comes out transposed then you might need V3, V4 and V5 in a different order but the approach will work). There is negligible computational cost in setting a dim attribute no matter how big the data is. There is a huge cost in your example because you are looping over rows. In R, looping over the rows of a dataset is almost never the quickest way to approach a problem, you should be using vectors instead.

JDL
  • 1,496
  • 10
  • 18