Efficiently replace a portion of a binary object

Question

My goal is to efficiently overwrite several sections of a blob, while leaving its overall content and structure pristine. To this end, I have written the following functions:

replace :: ByteString -> Int -> ByteString -> ByteString
replace source offset replacement = prefix <> replacement <> suffix
  where
    prefix = ByteString.take offset source 
    suffix = ByteString.drop (offset + ByteString.length replacement) source

fix :: Int -> ByteString -> ByteString -> ByteString
fix offset replacement source = replace source offset replacement

And this is how they work:

λ replace "I have a cat." 0x09 "dog"
"I have a dog."
λ fix 0x0A "fat" . fix 0x03 "dog" $ "My cat is red."
"My dog is fat."

Unfortunately, <> = append is linear time, and therefore these functions too. Having the knowledge that the replacement is just as long as the section being replaced, surely we can do better?

I would be especially joyful if it happens that we can replace multiple sections in time less than linear over the number of replacements.

By linear you mean linear proportional to the haystack rather than the needle, right? Well, as long as you produce a copy with some bytes modified, you can't avoid that. If you want destructive, in-place substitution, you may want the ST monad. — sshine, Apr 25 '18 at 08:32
@SimonShine I was referring to the number of replacements. I will edit the post to indicate that. — Ignat Insarov, Apr 25 '18 at 08:57
@SimonShine Would be great if you could write an answer that shows how to perform that with ST monad. As I understand, we will have to regard a ByteString as an array, and I am unsure if it is morally correct since the underlying representation may change, and whether it is reasonable to use a plain array here. — Ignat Insarov, Apr 25 '18 at 09:01
I might later today; doing some quick research, I'd either try combining [ST](https://stackoverflow.com/questions/34494893/how-to-understand-the-state-type-in-haskells-st-monad) and [STArray from Data.Array.ST](https://wiki.haskell.org/Arrays#Mutable_arrays_in_ST_monad_.28module_Data.Array.ST.29), but going between ByteString and STArray has a cost. Alternatively, [using Data.Conduit.Binary](https://www.reddit.com/r/haskell/comments/29nvsx/how_to_get_good_performance_when_processing/) seems to let you do it directly on ByteStrings. I've never tried something like this, though. — sshine, Apr 25 '18 at 09:21
@SimonShine Thank you, Simon. I will be looking into these links now. I think a constant comlexity isomorphism between a strict byte string and an ST array would solve the problem, but I am not sure how to go about obtaining one. — Ignat Insarov, Apr 25 '18 at 09:35
If you are doing a lot of replacements, consider converting to a lazy bytestring (for which concat is constant time). For absolutely optimal performance, your best bet is to copy the bytestring, then do a mutable replacement on the underlying ForeignPtr of the copy. — user2407038, Apr 25 '18 at 12:19
@user2407038 And how does one do that? And why is copying unavoidable? (If I promise that no other piece of logic cares about a given byte string, could I change it in place?) Perchance you could be bothered to summarize this all in a proper answer? — Ignat Insarov, Apr 25 '18 at 12:33
@Ignat Insarov How does your data come into your program? I have a strong feeling that making just a single copy of the array in order to make a series of mutations is *not* going to be your rate-limiting step, especially if your data came in from the disk, the network, or spent any time as the type `[Word8]`. — hegel5000, Apr 25 '18 at 13:05
@hegel5000 I'm reading it from a file with `ByteString.readFile`, so, yes, there is no problem at all in making a copy. I'm just being a perfectionist for the cause of understanding the case better. — Ignat Insarov, Apr 25 '18 at 13:23
I've gone from writing database type stuff in Haskell to doing scientific computing in C++. C++ doesn't have GC which means it can't do persistent data structures (e.g. `[]` or `Data.Map`) without a lot of setup. The result is that there is a _lot_ of copying, even if you aren't going for pure functional programming (which I do). But now I profile my programs a lot and I've found that most of that copying just doesn't show up, and if it does then it's not the O(n) copying step per se that's the problem but the O(1) allocation/deallocation steps. — hegel5000, Apr 25 '18 at 13:42

Regis Kuckaertz · Accepted Answer · 2018-04-25T16:43:32.387

2

Given concat is linear in the size of the list and both take and drop are O(1), this may be as good as it gets:

replace source offset replacement = 
  concat [ take offset source
         , replacement
         , drop n source
         ] 
  where n = offset + length replacement

For an arbitrary number of replacements:

replacemany :: [(Int, ByteString)] -> ByteString -> ByteString
replacemany wss bs = foldr f bs wss where
  f (l, ws) bs = replace bs l ws

edited Apr 25 '18 at 16:43

answered Apr 25 '18 at 11:51

Regis Kuckaertz

991
5
14

Well, this is definitely very readable and nice. But I still hope there is a better solution performance wise. – Ignat Insarov Apr 25 '18 at 12:36
Also, I wonder how you would extend this to work with multiple offset-replacement pairs. – Ignat Insarov Apr 25 '18 at 12:42
1

With a fold of course :-) Well if I'm correct, if you take non-strictness into account this is amortised O(1). – Regis Kuckaertz Apr 25 '18 at 12:54
I'm not sure I see how it should go. Do you mind editing your thoughts on this into the answer? Especially on the amortized O(1) part. – Ignat Insarov Apr 25 '18 at 13:25
I thought the ByteString type worked somehow like a list, in that you would only pay the cost of concat when you actually consume the list, but then I looked into the source code and it’s using memcpy so now I’m not so sure. Anyway, just added some code, I hope it’s helpful. – Regis Kuckaertz Apr 25 '18 at 16:46
I don't see how you would go about doing n operations in one go, other than using SIMD instructions – Regis Kuckaertz Apr 25 '18 at 17:13

Efficiently replace a portion of a binary object

1 Answers1