0

When writing a parser I ran into the problem that there are two string slices that come from the same origin string and are next to each other in memory. Of course it would be possible to simply copy the strings and merge them back into one, but that would require unnecessary computational resources. Is there a clean way to solve this in rust without unsafe code? For better illustration, here is an example of how I would like to solve it:

fn main() {

     //This is the owned string. 
     //(Of course, this is also just a slice of a static string, but that makes no difference here).
     let origin: &str = "Hello World";

     //Substrings which borrow data from the original and should be adjacent in memory
     let a: &str = &origin[0..5];
     let b: &str = &origin[5..11];

     //If the representation of a and b on the stack is:
     //      a: { ptr: PointerA, len: LenA }
     //      b: { ptr: PointerB, len: LenB }
     //Then PointerA + LenA should be PointerB
     //From this I conclude that there must be a way to combine these strings into c,
     //which in turn would have this representation on the stack:
     //      c: { ptr: PointerA, len: LenA + LenB }
     //The merge method doesn't actually exist, it's just a example of how I would imagen the api to look like.
     let c = a.merge(b).unwrap();

     assert!(c == origin)
}

Contrast this with the more inefficient current solution:

fn main() {

     let origin: &str = "Hello World";

     let a: &str = &origin[0..5];
     let b: &str = &origin[5..11];

     //Here both strings are simply copied to another location in the heap
     //and need unnecessarily more memory, because the stored data exactly matches the data in origin
     let c = a.to_owned() + b;

     assert!(c == origin)
}

EDIT: This is a example of how i would implemented this with unsafe code, but i really don't know if it is actually safe

fn main() {

 let origin: &str = "Hello World";

 let a: &str = &origin[0..5];
 let b: &str = &origin[5..11];

 let c = merge(a, b).unwrap();

 assert!(c == origin)
}

fn merge<'a>(one: &'a str, two: &'a str) -> Option<&'a str> {
     unsafe {
         let one: [usize; 2] = std::intrinsics::transmute(one);
         let two: [usize; 2] = std::intrinsics::transmute(two);
         if let Some(len) = one[1].checked_add(two[1]) {
             if one[0] + one[1] == two[0] {
                 Some(std::intrinsics::transmute([one[0], len]))
             } else {
                 None
             }
         } else {
             None
         }
    }
}
Goldenprime
  • 324
  • 1
  • 10
  • 1
    Regarding your unsafe code: you don't have to use transmutes here. Check out `str::as_mut_ptr` and `str::from_raw_parts` for a better (but still unsafe) way. – Dogbert Jul 04 '22 at 16:17

1 Answers1

0

You can chain the character within the splitted string:

fn main() {

     //This is the owned string. 
     //(Of course, this is also just a slice of a static string, but that makes no difference here).
     let origin: &str = "Hello World";

     //Substrings which borrow data from the original and should be adjacent in memory
     let a: &str = &origin[0..5];
     let b: &str = &origin[5..11];

     let c = a.chars().chain(b.chars());
    
     assert_eq!(c.collect::<String>().as_str(), origin);
}

Playground

But notice that for most operation requiring &str, you would have to create a new string anyway.

So it bring us to unsafe life, and as pointed by @chayimfriedman it is UB:

use std::{slice, str};

fn main() {
    //This is the owned string.
    //(Of course, this is also just a slice of a static string, but that makes no difference here).
    let origin: &str = "Hello World";

    //Substrings which borrow data from the original and should be adjacent in memory
    let a: &str = &origin[0..5];
    let b: &str = &origin[5..11];

    let c: &str = merge(a, b).unwrap();

    assert_eq!(c, origin);
}

fn merge<'a>(a: &'a str, b: &'a str) -> Result<&'a str, String> {
    let a_len = a.len();
    let a_ptr = a.as_ptr();
    let b_ptr = b.as_ptr();
    let b_len = b.len();

    if a_ptr as usize + a_len != b_ptr as usize {
        return Err("Strings are not alighned".to_string());
    }

    Ok(unsafe {
        str::from_utf8_unchecked(slice::from_raw_parts(a_ptr as *const u8, a_len + b_len))
    })
}

Playground

Note that eventually you could avoid the pointers casting and instead use ptr::addr, which as for rust 1.62 it is still on nightly.

Netwave
  • 40,134
  • 6
  • 50
  • 93
  • Are you sure that this is really faster than the second method that is already in my question, because this looks like both strings are copied. – Goldenprime Jul 04 '22 at 15:55
  • @Goldenprime, indeed they are copied, that is why I remark that for most operation you would need to create a new string anyway. – Netwave Jul 04 '22 at 15:57
  • also it should be slightly faster to use iterators in this case, instead of to_owned and then appending operation. The thing with iterators is if you can avoid copying just by using what you already have. Otherwise you **have to** copy. – Netwave Jul 04 '22 at 16:00
  • Isn't this chars() iterator also an immutable borrow of the string slice and therefore can't change the underlying memory, so everything has to be copied anyway? Also, please take a look at the code I added in the question. That's more or less what I'm searching for in terms of a standard or third-party library method, because I really don't feel confomrtable writing unsafe code. – Goldenprime Jul 04 '22 at 16:07
  • @Goldenprime, you cannot avoid the unsafe code I think. But you can actually make it better. – Netwave Jul 04 '22 at 16:12
  • **YOUR UNSAFE CODE IS UB** (as well as the OP's). See my answer to the duplicate I linked. – Chayim Friedman Jul 04 '22 at 21:16
  • @ChayimFriedman good point thanks! Btw, it would have been enough to bold your message, no need to caps it, it is like screaming angry when reading it ;) – Netwave Jul 05 '22 at 06:59