4

I have an array in form {{int, int, real, ... , string, real, ...},...} with dimensions of approx 1,000,000 x 400.

My goal is to minimize the time that it takes to update a large number of selective values in this array.

If the values were adjacent, I could do something like

arr[[...]] = ParallelMap[ updateFunc,arr[[...]] ]

but Part[] doesn't take selective values, like say Extract[] can. So arr[[{{1,2},{5,7},...}]] is not an option (it does something entirely different), and updating Extract doesn't place values back into the array. Believe me, against my better judgement, I've tried: Set::write: "Tag Extract in Extract[{1,2,3,4,5},{{1},{3},{5}}] is Protected.".

I tried SetSharedVariable[arr] and then using ParallelMap around the individual updates, but holy cow is using shared variables time consuming!

I finally settled on the fastest method I've found, which is

arr=ParallelTable[updateFunc[row],{row,arr}];

It is still painfully slow, and I know there's a better way than to (a) retouch every value, (b) create a whole new temp table in memory.

Help please!

Leonid Shifrin
  • 22,449
  • 4
  • 68
  • 100
Gregory Klopper
  • 2,285
  • 1
  • 14
  • 14
  • could you provide some more information about what "update" actually means. For a list of this size I would be trying to figure out a way of using `Part` if at all possible. `Replace` and similar functions will rewrite the list and become slow. Is there a way to identify the positions that get "updated." You could `Map` (`ParallelMap`) the positions onto `Part`. (arr[[#]] = updateFunc[arr[[#]]]) & /@ positions – Mike Honeychurch Jan 07 '12 at 06:13
  • What exactly is your selection criteria? – halirutan Jan 07 '12 at 15:21
  • As an active member of the Mathematica tag, have you considered committing to the [area51.se] [Mathematica proposal](http://area51.stackexchange.com/proposals/37304/mathematica?referrer=DamSFi3dv5QIDM_9uBjtlA2)? We could use your help. – rcollyer Jan 08 '12 at 15:16
  • One example is any strings containing ":" are time, and need to be turned into scaled reals. As Leonid points out below, they happen to fall into columns as this is a rectangular array, so I can use parts to change a column at a time. Before I was doing `Position[arr, s_/;StringQ@s&&!StringFreeQ[s,":"],Heads->False]` – Gregory Klopper Jan 08 '12 at 15:19
  • ...of course it would still be nice to do jagged/sparse Part extractions, that do not have to be rectangular e.g. arr[[{{2;;6,9,25},{{5;;8,12},10,30}}]] would extract columns 5-8 and 12 for rows 2-6, and column 10 for row 9 and column 30 for row 25. Or however it would actually work, but it would be nice. :-) – Gregory Klopper Jan 08 '12 at 15:22
  • Greg, thank you for committing to the proposal. Your SO account and Area51 account aren't linked together. To reach beta, we need 100 experienced committers (200+ rep on some SE site) and 200 overall, so by linking the accounts, you help us twice. – rcollyer Jan 08 '12 at 16:36
  • You're welcome - it's a good initiative. – Gregory Klopper Jan 08 '12 at 19:56

4 Answers4

2

The fastest way to do this I could think of is to pre-process a list of positions to group together positions in the same column, and then update column-by-column with Part. This uses the fact that your array is rectangular (not ragged). Here is the code:

ClearAll[updateByColumn];
SetAttributes[updateByColumn, HoldFirst];
updateByColumn[l_, positions_, updateFunc_, updateFuncListable : (True | False) : False] :=
  MapThread[
    (l[[##]] = If[updateFuncListable, updateFunc@l[[##]], updateFunc /@ l[[##]]]) &,
    {#[[All, 1, 1]], #[[All, All, 2]]} &@GatherBy[positions, First]];

EDIT

This assumes that the updating does not depend on the previously updated values. If it does, one can write a more elaborate version of this code which would take that into account but will perhaps be somewhat slower.

END EDIT

Here is a small test example to see how it works:

randomString[] := FromCharacterCode@RandomInteger[{97, 122}, 5];

In[131]:= 
len = 10;
poslen = 10;
n = 1;
m = 1;
tst = 
  Table[{
     Sequence @@ RandomInteger[10000, n],
     Sequence @@ Table[randomString[], {m}],
     Sequence @@ RandomReal[10000, n]}, {len}
]
testPositions  = 
  Table[{RandomInteger[{1, Length[tst]}],RandomInteger[{1, Length@First@tst}]}, 
     {len}]

Out[135]= {{320, "iwuwy", 3082.4}, {3108, "utuwf", 4339.14}, {5799, "dzjht", 8650.81}, 
{3177, "biyyl", 6239.64}, {7772, "bfawf",  6704.02}, {1679, "lrbro", 1873.57}, 
{9866, "gtprg", 4157.83}, {9720, "mtdnx", 4379.48}, {5399, "oxlhh", 2734.21}, 
{4409, "dbnlx",  955.428}}

Out[136]= {{1, 2}, {4, 1}, {3, 2}, {7, 2}, {8, 1}, {5, 2}, {2, 2},
{7, 2}, {2, 2}, {6, 2}}

Here we call the function:

In[137]:= 
updateByColumn[tst, testPositions, f];
tst

Out[138]= {{320, f["iwuwy"], 3082.4}, {3108, f["utuwf"], 4339.14}, 
{5799, f["dzjht"], 8650.81}, {f[3177], "biyyl" 6239.64}, {7772, f["bfawf"], 6704.02},
{1679, f["lrbro"], 1873.57}, {9866, f["gtprg"], 4157.83}, {f[9720], "mtdnx", 4379.48}, 
{5399, "oxlhh", 2734.21}, {4409, "dbnlx", 955.428}}

Note that, since the function is HoldFirst, the original array is modified, which allows us to save on the memory that would be needed for the copy.

Now, generating the large sample with the same code as above, but with these values of parameters: len = 100000; poslen = 50000; n = 100; m = 100;, the call updateByColumn[tst,testPositions, f]; runs in 0.15 s. on my machine, and that's without parallelization. If your updating function updateFunc is Listable and that makes is much faster, you can set the optional third parameter to True to make it run potentially even faster.

You can employ more tricks to save on time/memory consumption. For example, if you know that certain columns of your original large array are filled only with certain packable numeric type (Integers, Reals or Complex), you can Map Developer`ToPackedArray on these specific columns, to significantly reduce the memory occupied by your array. The code to pack the array would be:

tstPacked = Table[0, {Length[tst]}];
Do[tstPacked [[i]] = Developer`ToPackedArray[tst[[All, i]]], {i, Length@First@tst}]; 

If, e.g., you produced tst with the above code and parameters len = 100000;poslen = 50000;n = 100;m = 10;, applying ByteCount gives 700800040 bytes for the array tst, but only 182028872 bytes for tstPacked (note that an attempt to Transpose, then Map Developer`ToPackedArray, and then Transpose again will fail, since the second Transpose would unpack all the columns). Note also that the columns will remain packed only if your updateFunc function produces the values of the same types as the original column elements, for each column type.

On top of this, you probably can change MapThread to some code using say ParallelMap, to leverage parallel capabilities.

I am a bit worried about your described dimensions of the full array. Your full array might not fit to memory - but I guess, that is another problem.

Leonid Shifrin
  • 22,449
  • 4
  • 68
  • 100
  • Leonid, I was hoping you'd answer this - you're the brain! :-) I'm already using PackedArray for the final output since I strip strings and convert them to numeric values (reason for the update). No, I'm not using updated values in other updates - I'm processing a large training set for a NN and SVM set of algorithms, so I simply need to quantify strings and scale/shift some large ints. – Gregory Klopper Jan 07 '12 at 16:04
  • As for parallelization, I've not tried your solution yet, though it seems it would work, but are you sure that ParallelMap will work instead of MapThread? In your example `l[[##]] =` isn't a shared variable, so ParallelMap won't return the results to the main kernel, unless I'm missing something. – Gregory Klopper Jan 07 '12 at 17:15
  • @Gregory I see (packed arrays). You are probably right that the parallel version won't work becuase we are mutating the state. If the fraction of mutated elements to all elements is large, you can consider creating a copy rather than changing the original list - in which case you probably can paralellize to create parts of that by different kernels. But, in my tests, the code was reatively fast without parallelization - scaling to your real set, it may take a second or two on a fast processor (I don't know whether or not this is acceptable for you). As a final resort, you could use C. – Leonid Shifrin Jan 07 '12 at 18:23
  • 1
    @Gregory But, given the size of your data, I'd consider writing a framework to store parts of that on disk in some efficient format (like .mx), and load pieces automatically on demand. It is not the first time that we find such functionality very desirable, here at SO. Have a look at this answer of mine:http://stackoverflow.com/questions/8247005/efficiently-working-with-and-generating-large-text-files/8250860#8250860, for something along these lines (although that answer presents a point solution, and there is still a long way to go from that to what can be called a framework). – Leonid Shifrin Jan 07 '12 at 18:26
  • That's probably a very good solution too. Take its sweet time to pre-process, and store in a data dump file. Although it would have been nice to assign to sparse elements in the array all at once, just like Part[] can do to continuous blocks of memory. I.e. arr[[list_of_elements]]=f/@arr[[list_of_elements]] Thanks a lot, Leonid, for presenting me with some really good alternatives for solutions to this problem. – Gregory Klopper Jan 08 '12 at 05:01
  • Leonid, I am almost certainly doing something wrong, but your code seem extremely slow on my system. Any comment? – Mr.Wizard Jan 13 '12 at 00:15
  • @Mr.Wizard You are right, I seem to have made a mistake when refactoring from rows (which I used initially) into columns. The short story is that the `GatherBy` line should be rather like `{#[[All, All, 1]], #[[All, 1, -1]]} &@GatherBy[positions, Last]`. I have no time to finish this today, but will look at it tomorrow and edit accordingly. Thanks for spotting this! – Leonid Shifrin Jan 13 '12 at 21:01
1

will check back tomorrow for more information from you but if you have a way of identifying which positions you want to "update" then how about

(arr[[#]] = updateFunc[arr[[#]]]) & /@ positions

and

ParallelMap[(arr[[#]] = updateFunc[arr[[#]]]) &, positions]

this assumes that your updating depends on the previous values -- which seems to be the case from your comment to Nassers answer -- and that you know the positions that must be updated. I think replacement rules will be slow for lists this size so Part seems preferrable.

Mike Honeychurch
  • 1,683
  • 10
  • 17
  • I've actually done this before. Only arr[[#]] will receive arr[[{3,5}]], which is obviously two rows, and not a single value, so I did arr[[First@#,Last@#]], this process ended up significantly slower than arr=ParallelTable[updateFunc@row,{row,arr},Method->"CoarsestGrained"]; Whereever possible, you want to take advantage of set-based updates, and your method would update (albeit in parallel) one value at a time. – Gregory Klopper Jan 07 '12 at 15:29
0

Unless I misunderstood you, may be you can try ReplacePart

(*make data once *)
$mat = Table[Random[], {3}, {3}]

which is

{{0.295376, 0.362912, 0.945531}, 
 {0.191438, 0.175706, 0.469595}, 
 {0.734491, 0.328592, 0.856225}}

I use Map first to map ReplacePart to replace some parts of the matrix by zero

mat = $mat;
pos = Position[mat, x_ /; x < .5]
(*---> {{1, 1}, {1, 2}, {2, 1}, {2, 2}, {2, 3}, {3, 2}, {3, 3}} *)

now use Map

mat = Map[ReplacePart[#[[1]], #[[2]]] &, {{mat, pos -> 0.}}];
mat

gives

{{0., 0., 0.945531}, {0., 0., 0.}, {0.734491, 0., 0.856225}}

now do the same, using ParallelMap

mat = $mat;
mat = ParallelMap[ReplacePart[#[[1]], #[[2]]] &, {{mat, pos -> 0.}}];
mat

and it gives same result

{{{0., 0., 0.945531}, {0., 0., 0.}, {0.734491, 0., 0.856225}}}

edit(1)

Well, I tried this:

using Map only first

$mat = {{0.295376, 0.362912, 0.945531}, {0.191438, 0.175706, 
    0.469595}, {0.734491, 0.328592, 0.856225}};
mat = $mat;
pos = Position[mat, x_ /; x < .5];
Map[(mat = ReplacePart[mat, # -> 0]) &, pos];
mat

gives

{{0, 0, 0.945531}, {0, 0, 0}, {0.734491, 0, 0.856225}}

But when I use ParallelMap, it does no update the matrix for some reason:

mat = $mat;
ParallelMap[(mat = ReplacePart[mat, # -> 0]) &, {{1, 1}}];
mat

same mat as before. I am not sure now why, if I can figure it out, will update, for this is the best I have for now. good luck

Nasser
  • 12,849
  • 6
  • 52
  • 104
  • Not a bad idea at all, but how do I actually use the old value at the position which is being replaced to create the new value. E.g. in terms of your example, if the value is x<.5, how would I add .25 to the old values? – Gregory Klopper Jan 07 '12 at 04:50
  • ...one more thing... Aren't you actually not running anything at all in parallel? You send ` {{mat, pos -> 0.}}` to ParallelMap, which maps ReplacePart over exactly 1 element of this List[List[map,Rule[]]]. So you're not parallelizing the replace part, only the act of starting it, which is done once. – Gregory Klopper Jan 07 '12 at 04:53
  • see in my original post why ParallelMap doesn't update - arr is not a Shared Variable. To make the update work, the update should create a value list in parallel and put it in the desired cells all at once outside of parallelism, otherwise it attempts to re-distribute the updated value of arr to all kernels, which takes a lot of overhead. – Gregory Klopper Jan 07 '12 at 15:31
0

You may find utility in this construct:

update = ReplacePart[#, Thread[#2 -> #3 /@ Extract[#, #2]]] &;

Use:

table = Array[Times, {7, 7}];

parts = {{5, 1}, {7, 7}, {5, 2}, {4, 6}, {2, 3}, {4, 7}};

update[table, parts, Framed] // Grid

Mathematica graphics

Mr.Wizard
  • 24,179
  • 5
  • 44
  • 125