0

I am iterating over flatProduct.Catalogs slice and populating my productCatalog concurrent map in golang. I am using upsert method so that I can add only unique productID's into my productCatalog map.

This uses a linear scan to check for duplicate product IDs but in my case I have more than 700k productId's so it is very slow for me. I am looking for ways to make it more efficient.

Below code is called by multiple goroutines in parallel that is why I am using concurrent map here to populate data into it.

var productRows []ClientProduct
err = json.Unmarshal(byteSlice, &productRows)
if err != nil {
    return err
}
for i := range productRows {
    flatProduct, err := r.Convert(spn, productRows[i])
    if err != nil {
        return err
    }
    if flatProduct.StatusCode == definitions.DONE {
        continue
    }
    r.products.Set(strconv.Itoa(flatProduct.ProductId, 10), flatProduct)
    for _, catalogId := range flatProduct.Catalogs {
        catalogValue := strconv.FormatInt(int64(catalogId), 10)
        // how can I improve below Upsert code for `productCatalog` map so that it can runs faster for me?
        r.productCatalog.Upsert(catalogValue, flatProduct.ProductId, func(exists bool, valueInMap interface{}, newValue interface{}) interface{} {
            productID := newValue.(int64)
            if valueInMap == nil {
                return []int64{productID}
            }
            oldIDs := valueInMap.([]int64)

            for _, id := range oldIDs {
                if id == productID {
                    // Already exists, don't add duplicates.
                    return oldIDs
                }
            }
            return append(oldIDs, productID)
        })
    }
}

Above upsert code is very slow for me and it takes lot of time to add unique product id's as a value in my concurrent map. Here is how productCatalog is defined.

productCatalog *cmap.ConcurrentMap

Here is the upsert method which I am using - https://github.com/orcaman/concurrent-map/blob/master/concurrent_map.go#L56

This is how I am reading data from this cmap:

catalogProductMap := clientRepo.GetProductCatalogMap()
productIds, ok := catalogProductMap.Get("200")
var data = productIds.([]int64)
for _, pid := range data {
  ...
}
rosed
  • 137
  • 8
  • You can always use another `map[int64]struct{}` and recreate slice from keys at the end. – medasx Mar 21 '22 at 18:50
  • Nested map will add lot of allocation causing lot of memory usage right? Is there any other option or this is the only option I have – rosed Mar 21 '22 at 19:48
  • It wouldn't allocate lot of memory with `struct{}` size of it is 0. I tried also with binary search on slice, but you would have to ensure that IDs are sorted. But still map should be the best choice. – medasx Mar 22 '22 at 08:07

1 Answers1

1

To summarize answers from comments:

The upsert function is O(n**2) where n is the length of the slice.

The problem as you also mentioned is iterating through whole slice to find duplicate. This can be avoided using another map.

Example:

r.productCatalog.Upsert(catalogValue, flatProduct.ProductId, func(exists bool, valueInMap interface{}, newValue interface{}) interface{} {
    productID := newValue.(int64)
    if valueInMap == nil {
        return map[int64]struct{}{productID: {}}
    }
    oldIDs := valueInMap.(map[int64]struct{})
    
    // value is irrelevant, no need to check if key exists 
    oldIDs[productID] = struct{}{}
    return oldIDs
})

Nested map will add lot of allocation causing lot of memory usage right?

Nope, using empty struct won't create new allocations or increase memory usage. You can find plenty of articles/questions about empty struct and its usage. (e. g. What uses a type with empty struct has in Go?)

Note: you could use some kind of optimised search for array like binary search used by sort.Search, but it requires sorted array.

medasx
  • 634
  • 4
  • 7
  • Got it now so in this case what will be the types of key and value of `productCatalog` map? In general what I had earlier was key was string and values were int64 `productId's` so if it stays the same then my code to grab data out of it will stay same so just wanted to confirm. – rosed Mar 22 '22 at 16:04
  • concurrent map is type `map[string]interface{}`, your usage was `map[catalogValue][]int64{productID_1, productID_2}` and with map you will have `map[catalogValue]map[int64]struct{}{ productID_1: {}, productID_2: {}}`. – medasx Mar 22 '22 at 18:48
  • Not sure how are you getting data from it but you gotta iterate through map keys to get list of IDs. `for productID, _ := range r.productCatalog.Get(someKey) {}` – medasx Mar 22 '22 at 18:51
  • I updated my question to add details on how I am reading the data from that cmap currently. So that stays same with this new approach too right? If not, can you help me understand on how that will change with this new design to read it now? – rosed Mar 22 '22 at 19:51
  • Output of this `r.productCatalog.Get(someKey)` with your suggestion is like this `map[806:{} 911:{} 1851:{}]` so I am not sure the for loop you suggested will work that way right? We need to make some changes to it too? – rosed Mar 22 '22 at 22:25
  • 1
    You should check language basics it could help you in future projects. Anyways you can [range](https://gobyexample.com/range) through map keys `for key,value := range map` so in your case `data, ok := productIds.(map[int64]struct{})` and simply range through map`for pid, _ := range data{})` – medasx Mar 23 '22 at 09:33
  • I am seeing a race condition because of `oldIDs[productID]` map. I opened a new question [here](https://stackoverflow.com/questions/75025126/race-condition-while-writing-and-reading-from-the-map). I wanted to see if you can help me out. – rosed Jan 05 '23 at 23:57