5

I'm trying to extend bluss's rust-itertools with SQL-like join iterators. I encountered a particular problem with RIGHT OUTER JOIN using a hash join strategy (the strategy itself is actually very simple).

The iterator adaptor struct takes 2 input iterators of which the second (the right) is loaded into the HashMap. The iteration works as follows:

  1. The item from the left iterator is matched against the map - in case of a match return both items
  2. When the left iterator is exhausted, return the non-matched values from the map

The problem is with the second part where I tried to store the map's Values iterator along with the map to store its iteration state. But as I learned in this answer, it's not possible in rust. Unfortunately I have no idea how it could be done otherwise.

Here is the complete code for the INNER JOIN adaptor, which does the first part:

use std::collections::HashMap;
use std::hash::Hash;

pub struct HashJoinInner<I, K, V0, V1> where
    I: Iterator<Item=(K, V0)>,
    K: Hash + Eq,
    V1: Clone,
{
    left: I,
    right: HashMap<K, V1>,
}

impl<I, K, V0, V1> HashJoinInner<I, K, V0, V1> where
    I: Iterator<Item=(K, V0)>,
    K: Hash + Eq,
    V1: Clone,
{
    /// Create a `HashJoinInner` iterator.
    pub fn new<J>(l: I, r: J) -> Self
        where J: Iterator<Item=(K, V1)>
    {
        let mut hm: HashMap<K, V1> = HashMap::new();
        for (k, v) in r {
            hm.insert(k, v);
        }
        HashJoinInner {
            left: l,
            right: hm,
        }
    }
}

impl<I, K, V0, V1> Iterator for HashJoinInner<I, K, V0, V1> where
    I: Iterator<Item=(K, V0)>,
    K: Hash + Eq,
    V1: Clone,
{
    type Item = (V0, V1);

    fn next(&mut self) -> Option<Self::Item> {
        loop {
            match self.left.next() {
                Some((k0, v0)) => match self.right.get(&k0) {
                    Some(v1) => return Some((v0, Clone::clone(v1))),
                    None => continue,
                },
                None => return None,
            }
        }
    }
}

I'll be grateful for any idea.

Community
  • 1
  • 1
milancio
  • 91
  • 5
  • I apologize, these are my first steps on Stack Overflow and I did not realized I could have add an answer to my own question. Thanks again for your help - your idea helped me to find the solution (posted below). – milancio Aug 17 '15 at 09:26

2 Answers2

1

You cannot store the Values iterator because it contains references to the HashMap. These references could become invalid if you move the map. However, you can consume the HashMap using the into_iter method. That owns all the values of the HashMap and can be moved into a new struct.

Here's a tweaking of your code from the earlier question. This isn't yet a left or right join. There's complexity about the switch from being done with one iterator to finishing off the other iterator.

use std::collections::hash_map::{HashMap, IntoIter};
use std::hash::Hash;

struct Foo<K, V>
    where K: Hash + Eq,
          V: Clone,
{
    iter: IntoIter<K, (V, bool)>,
}

impl<K, V> Foo<K, V>
    where K: Hash + Eq,
          V: Clone,
{
    fn new<I>(it: I) -> Self
        where I: Iterator<Item=(K, V)>
    {
        let mut map = HashMap::new();
        for (k, v) in it {
            map.insert(k, (v, false));
        }
        Foo { iter: map.into_iter() }
    }
}

impl<K, V> Iterator for Foo<K, V>
    where K: Hash + Eq,
          V: Clone,
{
    type Item = V;
    fn next(&mut self) -> Option<Self::Item> {
        loop {
            match self.iter.next() {
                Some((_, (v, false))) => return Some(v.clone()),
                Some(_) => continue,
                None => return None,
            }
        }
    }
}

fn main() {
    let it = (0..).zip("AB".chars());
    let foo = Foo::new(it);
    for v in foo {
        println!("{}", v);
    }
}

However you don't need to do any of that to get what you want. You can simply create a hashmap and check it as you iterate over the other item. I accidentally created a left outer join, but just flip the arguments to get a right outer join. ^_^

use std::collections::hash_map::HashMap;
use std::hash::Hash;

struct LeftOuterJoin<L, K, RV> {
    left: L,
    right: HashMap<K, RV>,
}

impl<L, K, RV> LeftOuterJoin<L, K, RV> 
    where K: Hash + Eq
{
    fn new<LI, RI>(left: LI, right: RI) -> Self
        where L: Iterator<Item=LI::Item>,
              LI: IntoIterator<IntoIter=L>,
              RI: IntoIterator<Item=(K, RV)>
    {
        LeftOuterJoin {
            left: left.into_iter(),
            right: right.into_iter().collect()
        }
    }
}

impl<L, K, LV, RV> Iterator for LeftOuterJoin<L, K, RV>
    where L: Iterator<Item=(K, LV)>,
          K: Hash + Eq,
          RV: Clone
{
    type Item = (K, LV, Option<RV>);

    fn next(&mut self) -> Option<Self::Item> {
        match self.left.next() {
            Some((k, lv)) => {
                let rv = self.right.get(&k);
                Some((k, lv, rv.cloned()))
            },
            None => None,
        }
    }
}

fn main() {
    let mut left = HashMap::new();
    left.insert(1, "Alice");
    left.insert(2, "Bob");

    let mut right = HashMap::new();
    right.insert(1, "Programmer");

    for (id, name, job) in LeftOuterJoin::new(left.into_iter(), right) {
        println!("{} ({}) is a {:?}", name, id, job);
    }
}
Shepmaster
  • 388,571
  • 95
  • 1,107
  • 1,366
  • Interesting, it could be working. The only problem left is that when I consume the map at the beginning I won't be able to test if the left iterator keys are present in the map. However, using an Option<> with some restructuring of the code might work. I'll try to play with it and I'll update the question in case it work. Many thanks for your time – milancio Aug 14 '15 at 14:38
  • @milancio I've updated with a working left outer join – Shepmaster Aug 14 '15 at 14:46
  • Unfortunately hash join strategy is not symmetric, so flipping the input won't do. The left iterator is not guaranteed to be collectable in RAM, only the right one. It is this what it makes so efficient, because the left iter can be arbitrarily large and do not need to be sorted. I've already finished [merge join](https://github.com/milancio42/rust-itertools/blob/merge-join/src/merge_join.rs) join if you're interested ;) – milancio Aug 14 '15 at 15:29
1

Thanks to Shepmaster's idea of using std::collections::hash_map::IntoIter I've managed to solve the problem.

Here is the complete solution for RIGHT OUTER JOIN using the hash join strategy:

use std::collections::hash_map::{HashMap, IntoIter,};
use std::mem;
use std::hash::Hash;

#[must_use = "iterator adaptors are lazy and do nothing unless consumed"]
pub struct HashJoinRightOuter<L, K, RV> {
    left: L,
    map: HashMap<K, (RV, bool)>,
    /// exclusion iterator - yields the unmatched values from the map. It is created once the left
    /// iterator is exhausted
    excl_iter: Option<IntoIter<K, (RV, bool)>>,
}

impl<L, K, RV> HashJoinRightOuter<L, K, RV> 
    where K: Hash + Eq,
{
    /// Create a `HashJoinRightOuter` iterator.
    pub fn new<LI, RI>(left: LI, right: RI) -> Self
        where L: Iterator<Item=LI::Item>,
              LI: IntoIterator<IntoIter=L>,
              RI: IntoIterator<Item=(K, RV)>
    {
        let mut map: HashMap<K, (RV, bool)> = HashMap::new();
        for (k, v) in right.into_iter() {
            map.insert(k, (v, false));
        }
        HashJoinRightOuter {
            left: left.into_iter(),
            map: map,
            excl_iter: None,
        }
    }

    /// Moves the map to `self.excl_iter`
    ///
    /// Once the left iterator is exhausted, the info about which keys were matched is complete.
    /// To be able to iterate over map's values we need to move it into its `IntoIter`.
    fn set_excl_iter(&mut self) {
        let map = mem::replace(&mut self.map, HashMap::<K, (RV, bool)>::new());
        self.excl_iter = Some(map.into_iter());
    }
}

impl<L, K, LV, RV> Iterator for HashJoinRightOuter<L, K, RV> 
    where L: Iterator<Item=(K, LV)>,
          K: Hash + Eq,
          RV: Clone,
{
    type Item = (Option<LV>, RV);

    fn next(&mut self) -> Option<Self::Item> {
        loop {
            match self.excl_iter {
                // the left iterator is not yet exhausted
                None => match self.left.next() {
                    Some((lk, lv)) => match self.map.get_mut(&lk) {
                        Some(rt) => {
                            rt.1 = true; // flag as matched
                            return Some((Some(lv), Clone::clone(&rt.0)))
                        },
                        None => continue, // not interested in unmatched left value
                    },
                    // the left iterator is exhausted so move the map into `self.excl_iter`.
                    None => self.set_excl_iter(),
                },
                // iterate over unmatched values
                Some(ref mut r) => match r.next() {
                    Some((_, (rv, matched))) => {
                        if !matched {
                            return Some((None, rv));
                        } else {
                            continue;
                        }
                    },
                    None => return None,
                }
            }
        }
    }
}

fn main() {
    let a = (0..).zip("AB".chars());
    let b = (1..).zip("XY".chars());
    let mut it = HashJoinRightOuter::new(a, b);
    assert_eq!(it.next(), Some((Some('B'), 'X')));
    assert_eq!(it.next(), Some((None, 'Y')));
    assert_eq!(it.next(), None);
}

At the beginning I failed because I tried to store both the data and it's reference in the same struct, which has no meaning anyway. What I really wanted was to store the data first, do some magic with it and once done, move it into another field to work with its transformation.

This can be used to solve other self-referencing struct problems as well.

milancio
  • 91
  • 5