0

I have a ConcurrentDictionary of arrays, where each array has the same fixed size. It looks like this: ConcurrentDictionary<int, double[]> ItemFeatures

I want to normalize the values in the list by dividing all the values by the maximum of the values in that column. For example, if my lists are of size 5, I want every element in the first position to be divided by the maximum of all the values in that position, and so on for position 2 onwards.

The naive way that I can think of doing this, is by first iterating over every list and every element in the list, and storing the max value per position. Then iterating over them again and dividing them by the previously found maximum values.

Is there a more elegant way to do this in Linq perhaps? These dictionaries would be large, so the more efficient/least time consuming, the better.

Servy
  • 202,030
  • 26
  • 332
  • 449
Antimony
  • 2,230
  • 3
  • 28
  • 38
  • 4
    LINQ wouldn't made the code do any less, at best it would just prevent you from needing to write as much code to do the same thing, You're *going* to need to iterate through all of the data to find the max value at each position; you can't find the max value without looking at every value, there's simply no way around that. – Servy Jun 01 '17 at 18:11
  • Are any other threads going to be accessing the `ConccurrentDictionary` while you're doing this? If so, the problem is likely unsolvable. – Servy Jun 01 '17 at 18:13
  • @Servy Obviously Linq cannot perform miracles. My question is whether that would somehow be more efficient (in terms of memory allocation or time), in the same way that vectorization in MATLAB is more efficient than loops. – Antimony Jun 01 '17 at 18:18
  • No, nothing would be accessing it at the same time. – Antimony Jun 01 '17 at 18:18
  • LINQ is a convenient way of writing certain queries. Anything you can do with it you can do without it, but the code without it might be longer or, more likely, convey the meaning less clearly. Not all LINQ operations are loops under the hood (although a lot of them are little more than a loop) but they *are* mostly all reasonably straightforward iterative processes that people write without LINQ all the time. – Servy Jun 01 '17 at 18:24
  • Why have a ConcurrentDictionary in the first place if multiple threads aren't accessing it at the same time? Sounds like you should just be using a regular `Dictionary`. – Servy Jun 01 '17 at 18:26
  • @Servy It could be accessed by multiple threads later on, after the normalization. – Antimony Jun 01 '17 at 18:29

2 Answers2

0

No, that will actually be the most efficient way. In the end, this is what you need to do anyway, you can't skip anything. You can probably write it in LINQ somehow, but the performance will be worse because it will have a lot of function calls and memory allocations. LINQ doesn't perform miracles, it's just a (sometimes) shorter way of writing things.

What can speed this up is if your algorithm has a good "cache locality" - in other words, if you access the computer memory in a sequential way. That's pretty hard to guarantee in an environment like .NET, but a loop like you described probably has the best chances of getting close to it.

Vilx-
  • 104,512
  • 87
  • 279
  • 422
  • I was hoping something similar to vectorization existed in C# that could make things faster than loops. – Antimony Jun 01 '17 at 18:23
  • What is Vectorization? :) – Vilx- Jun 01 '17 at 20:21
  • [What is "Vectorization"?](https://stackoverflow.com/questions/1422149/what-is-vectorization) – Antimony Jun 01 '17 at 21:43
  • Ahh, I see. Hmm... ahh, here: https://msdn.microsoft.com/en-us/library/dn879696(v=vs.110).aspx Looks like support for that was finally added in 4.6. Heh, that's something I didn't know until now! But you have to be explicit about it, there's nothing automatic. Well, maybe you can take advantage of that! Although I really don't know how big the performance gain would be for something like this. If you say that you have lots and lots of data then I suspect that CPU cache misses would be a bigger problem, and using SIMD would just make you hit that bottleneck sooner. – Vilx- Jun 01 '17 at 22:44
  • @Antimony - If you don't know what I'm talking about, this is a good article to start with: https://lwn.net/Articles/250967/ (confession: I haven't read it all myself). It's old, but I don't think it will get irrelevant anytime soon. Also - this beautiful visualization of the problem: http://www.overbyte.com.au/misc/Lesson3/CacheFun.html – Vilx- Jun 01 '17 at 22:52
  • 1
    @Antimony - To put this theory to practice: The only way you can be sure how your values are laid out in memory is if they are all in the same array. So have a large array with ALL your values, and make the dictionary contain indices into that array, rather than many small arrays. This will have dual benefits - first, you have total control over how they are laid out in memory; second, you reduce allocations/objects (each separate array is a separate object that the GC needs to track). Then when it's time to iterate, do it over that big array rather than the dictionary. – Vilx- Jun 01 '17 at 22:54
  • @Antimony - I expect this approach to be noticeably faster, if you have many megabytes of data If the whole dictionary is already smaller than the L2 cache of the CPU, then there will probably be little to gain, and the process will be very fast anyway. – Vilx- Jun 01 '17 at 22:55
  • @Antimony - Also, this is totally compatible with the SIMD approach. Why not use both? :) – Vilx- Jun 01 '17 at 23:05
  • Thanks @Vilx. I shall look into this. I especially like your idea about pre-allocating a large array but in my case, it may not be possible to know about the data beforehand, and also, the size of the dictionary would keep increasing as newer data keeps streaming in. But thanks, it's a great suggestion anyway! – Antimony Jun 01 '17 at 23:44
  • 1
    Well, you can also use a `List` which uses an array internally and resizes it as needed. Or maybe some hybrid scheme where you have a list of large fixed-size arrays. – Vilx- Jun 02 '17 at 06:24
0

LINQ is designed for querying data, not modifying data. You can use a little LINQ to compute the maximums, but that is about it:

var cols = ItemFeatures.First().Value.Length;
var maxv = new double[cols];
for (var j1 = 0; j1 < cols; ++j1)
    maxv[j1] = ItemFeatures.Values.Select(vs => vs[j1]).Max();

foreach (var kvp in ItemFeatures)
    for (var j1 = 0; j1 < cols; ++j1)
        kvp.Value[j1] /= maxv[j1];
NetMage
  • 26,163
  • 3
  • 34
  • 55
  • The last loop can be eliminated too: `ItemFeatures = ItemFeatures.ToDictionary(x => x.Key, x => x.Value.Zip(maxv, (a, b) => a / b).ToArray());` The only thing is, it cannot reconcile the `ToDictionary()` with the `ConcurrentDictionary` data type. :( – Antimony Jun 01 '17 at 21:23
  • That would seem to be a flaw. Also, using `ToDictionary` and `ToArray` isn't going to be very efficient anyway. And `Zip` seems like overkill, why not just `Values.Select(a => a / maxv)` ? – NetMage Jun 01 '17 at 21:34