2

I have a vector of objects (objects are term nodes that amongst other fields contai a string field with the term string)

class TermNode {
private:
    std::wstring term;
    double weight;
    ...
public:
    ...
};

After some processing and calculating the scores these objects get finally stored in a vector of TermNode pointers such as

std::vector<TermNode *> termlist;

A resulting list of this vector, containing up to 400 entries, looks like this:

DEBUG: 'knowledge' term weight=13.5921
DEBUG: 'discovery' term weight=12.3437
DEBUG: 'applications' term weight=11.9476
DEBUG: 'process' term weight=11.4553
DEBUG: 'knowledge discovery' term weight=11.4509
DEBUG: 'information' term weight=10.952
DEBUG: 'techniques' term weight=10.4139
DEBUG: 'web' term weight=10.3733
...

What I try to do is to cleanup that final list for substrings also contained in phrases inside the terms list. For example, looking at the above list snippet, there is the phrase 'knowledge discovery' and therefore I would like to remove the single terms 'knowledge' and 'discovery', because they are also in the list and redundant in this context. I want to keep the phrases containing the single terms. I am also thinking about to remove all strings equal or less 3 characters. But that is just a thought for now.

For this cleanup process I would like to code a class using remove_if / find_if (using the new C++ lambdas) and it would be nice to have that code in a compact class.

I am not really sure on how to solve this. The problem is that I first would have to identify what strings to remove, by probably setting a flag as an delete marker. That would mean I would have to pre-process that list. I would have to find the single terms and the phrases that contain one of those single terms. I think that is not an easy task to do and would need some advanced algorithm. Using a suffix tree to identify substrings?

Another loop on the vector and maybe a copy of the same vector could to the clean up. I am looking for something most efficient in a time manner.

I been playing with the idea or direction such as showed in std::list erase incompatible iterator using the remove_if / find_if and the idea used in Erasing multiple objects from a std::vector?.

So the question is basically is there a smart way to do this and avoid multiple loops and how could I identify the single terms for deletion? Maybe I am really missing something, but probably someone is out there and give me a good hint.

Thanks for your thoughts!

Update

I implemented the removal of redundant single terms the way Scrubbins recommended as follows:

/**
 * Functor gets the term of each TermNode object, looks if term string
 * contains spaces (ie. term is a phrase), splits phrase by spaces and finally
 * stores thes term tokens into a set. Only term higher than a score of 
 * 'skipAtWeight" are taken tinto account.
 */
struct findPhrasesAndSplitIntoTokens {
private:
    set<wstring> tokens;
    double skipAtWeight;

public:
    findPhrasesAndSplitIntoTokens(const double skipAtWeight)
    : skipAtWeight(skipAtWeight) {
    }

    /**
     * Implements operator()
     */
    void operator()(const TermNode * tn) {
        // --- skip all terms lower skipAtWeight
        if (tn->getWeight() < skipAtWeight)
            return;

        // --- get term
        wstring term = tn->getTerm();
        // --- iterate over term, check for spaces (if this term is a phrase)
        for (unsigned int i = 0; i < term.length(); i++) {
            if (isspace(term.at(i))) {
if (0) {
                wcout << "input term=" << term << endl;
}
                // --- simply tokenze term by space and store tokens into 
                // --- the tokens set
                // --- TODO: check if this really is UTF-8 aware, esp. for
                // --- strings containing umlauts, etc  !!
                wistringstream iss(term);
                copy(istream_iterator<wstring,
                        wchar_t, std::char_traits<wchar_t> >(iss),
                    istream_iterator<wstring,
                        wchar_t, std::char_traits<wchar_t> >(),
                    inserter(tokens, tokens.begin()));
if (0) {
                wcout << "size of token set=" << tokens.size() << endl;
                for_each(tokens.begin(), tokens.end(), printSingleToken());
}
            }
        }
    }

    /**
     * return set of extracted tokens
     */
    set<wstring> getTokens() const {
        return tokens;
    }
};

/**
 * Functor to find terms in tokens set
 */
class removeTermIfInPhraseTokensSet {
private:
    set<wstring> tokens;

public:
    removeTermIfInPhraseTokensSet(const set<wstring>& termTokens)
    : tokens(termTokens) {
    }

    /**
     * Implements operator()
     */
    bool operator()(const TermNode * tn) const {
        if (tokens.find(tn->getTerm()) != tokens.end()) {
            return true;
        }
        return false;
    }
};

...

findPhrasesAndSplitIntoTokens objPhraseTokens(6.5);
objPhraseTokens = std::for_each(
    termList.begin(), termList.end(), objPhraseTokens);
set<wstring> tokens = objPhraseTokens.getTokens();
wcout << "size of tokens set=" << tokens.size() << endl;
for_each(tokens.begin(), tokens.end(), printSingleToken());

// --- remove all extracted single tokens from the final terms list
// --- of similar search terms 
removeTermIfInPhraseTokensSet removeTermIfFound(tokens);
termList.erase(
    remove_if(
        termList.begin(), termList.end(), removeTermIfFound),
    termList.end()
);

for (vector<TermNode *>::const_iterator tl_iter = termList.begin();
      tl_iter != termList.end(); tl_iter++) {
    wcout << "DEBUG: '" << (*tl_iter)->getTerm() << "' term weight=" << (*tl_iter)->getNormalizedWeight() << endl;
    if ((*tl_iter)->getNormalizedWeight() <= 6.5) break;
}

...

I could'nt use the C++11 lambda syntax, because on my ubuntu servers have currently g++ 4.4.1 installed. Anyways. It does the job for now. The way to go is to check the quality of the resulting weighted terms with other search result sets and see how I can improve the quality and find a way to boost the more relevant terms in conjunction with the original query term. It might be not an easy task to do, I wish there would be some "simple heuristics". But that might be another new question when stepped further a little more :-)

So thanks to all for this rich contribution of thoughts!

Community
  • 1
  • 1
Andreas W. Wylach
  • 723
  • 2
  • 10
  • 31
  • 2
    Seems like your question title is a little misleading... this isn't a vector issue, this is a text processing issue. Once you've identified the substrings, purging them from the vector is trivial, especially if there are only 400 entries. – Rook Jun 14 '12 at 09:18
  • 1
    Why not use a `vector` of smart pointers to `TermNode` objects? That way, you can save having to `delete` the pointers returned after the `find` operations? – dirkgently Jun 14 '12 at 09:19
  • I'd be tempted to sort the vector in reverse order of term length. Iterate over the vector, split each term into individual words and drop those words into a `std::set`. If the words already existed in the set, flag the term as needing to be deleted, then worry about purging the vector. – Rook Jun 14 '12 at 09:20
  • Oh, and if you used a `std::list` instead of a vector, you get constant time inserts and deletes, and you can delete an item without invalidating your iterator. Do you really need random access to items in the vector, or do you only really need to be able to traverse it? – Rook Jun 14 '12 at 09:22
  • @Rook: Actually, in general, it is advised to use `vector` by default and only move to other containers if you have special requirements. – Matthieu M. Jun 14 '12 at 09:46
  • Why is that advised, though? In the situation where you don't need random access, you do need iteration and you'd like reasonably efficient deletes during an iteration of the collection, `std::list` is ideal. If weight order was more important than speed of deletion, a `std::set` with a custom comparer might be better. In either case, a `std::vector` is not an ideal container even if it can be made to do the job. – Rook Jun 14 '12 at 09:52
  • Sorry, if the "vector" term in the title seems to be a little misleading. – Andreas W. Wylach Jun 14 '12 at 09:55
  • Several questions have arisen: you talk about substrings, but is it about words ? (ie should `app` be removed in favor of `application`). Also, should we be talking about sets (and subsets) ? (ie should `hello world` be removed in favor of `beautiful world, hello!`). Which introduces the issue of punctuation. And the fact that you seem to be dealing with Unicode => I expect you already handle (by yourself) the unicode related issues (some unicode sequences should be considered equivalent even if they are different bitwise). – Matthieu M. Jun 14 '12 at 09:55
  • @Rook: It is advised *by default* because it is simple and efficient. **Obviously** if you have specific requirements those should help drive the choice of container (which is just repeating myself, really). – Matthieu M. Jun 14 '12 at 09:56
  • You mean like being able to efficiently and conveniently remove items from the container as you iterate over it? (which is just repeating myself, really) – Rook Jun 14 '12 at 09:58
  • Matthieu, your thought is a good point. The program I write is a "similar queryterm advisor", taking a bunch of search results, build a term vector and inverted Index and calculate the td/idf weights. For now I am playing with the results and they are not bad. So I think to favor the phrases over the single terms. I need to do a lot more testing, inspect the results and not just killing the single terms. It is maybe one way to boost other phrases like "knowledge mining" and "text mining" which are in the top weighted terms. The original query was "data mining" of the analyzed search results. – Andreas W. Wylach Jun 14 '12 at 10:05
  • One more: I tokenize the search results by sentences, tokenize the sentences by widespace, filter out stopwords and afterwards produce ngrams (uni-, bi and tri-grams) out of these tokenstreams which get finally inserted in the index. I do that process with the title and the text summay snippet. It was the only way I thought to also gather phrases such as "knowledge mining" and the such. So punctuation is taken into account! – Andreas W. Wylach Jun 14 '12 at 10:10
  • 1
    @Rook: efficiently and conveniently removing items from a container is hard to gauge. Because of memory caching behaviors a `vector` generally outperforms a `list` for small quantities even for operations "in the middle", small being up to a few hundreds or a few thousands of items. The issue with big O complexity is that it gives the theoretical complexity at the limit, and blissfully ignores the constant factors and the fact that many people only ever deal with small collections to begin with. – Matthieu M. Jun 14 '12 at 14:10
  • Thankyou, that's exactly the sort of answer I was looking for. I'll bear it in mind in future! – Rook Jun 15 '12 at 07:59

2 Answers2

5

What you need to do is first, iterate through the list and split up all the multi-word values into single words. If you're allowing Unicode, this means you will need something akin to ICU's BreakIterators, else you can go with a simple punctuation/whitespace split. When each string is split into it's constituent words, then use a hash map to keep a list of all the current words. When you reach a multi-word value, then you can check if it's words have already been found. This should be the simplest way to identify duplicates.

Scrubbins
  • 150
  • 4
  • This is an interesting beginning, however it gets complicated when you consider the subset aspect. Specifically, I would guess that "Hello World" should be removed in favor of "Hello World I Love You". – Matthieu M. Jun 14 '12 at 09:48
  • @Matthieu: The OP did not give a clear specification in such matters, so I can't really comment. Also, I'd say that deque should be used by default. – Scrubbins Jun 14 '12 at 09:49
  • +1 for noticing (as I didn't) that although the questioner *said* "substrings", it might be more appropriate to look at complete words. A test case would be whether `"application"` should be removed in the presence of `"applications"`. Also, the point about Unicode being difficult to tokenize is vital for some uses. – Steve Jessop Jun 14 '12 at 09:51
  • It is sounding increasingly like the solution is to use some purpose built language-mangling library. Under Java there's all sorts of Lucene-related stuff, but text processing is not really the forte of C++. – Rook Jun 14 '12 at 10:01
  • Ooh, also, **"When you reach a multi-word value, then you can check if it's words have already been found"** is hazardous. __"Alice Hates Bob"__ might not want to be considered the same as __"Bob Hates Alice"__. – Rook Jun 14 '12 at 10:02
  • @Rook: Yes, I know this problem. But your example would not get removed. It would remain in the in recommended query terms list. These terms / phrases are a product out of the ngrams (uni- bi- and tri-grams) I insert into the index. So both phrases would remain. It is just the point I favor these phrases more then the single terms of that phrase which are also in the index, they would be inserted by the uni-gram. Anyhow, like stated in comments below, I need to do more testing and the the results. In general, "what to remove" questions is no simple question. – Andreas W. Wylach Jun 14 '12 at 10:44
  • The questioner did not specify many behaviours of the algorithm. This is just a starting point. – Puppy Jun 14 '12 at 10:48
  • Once again, I want to remove *single word terms* that are contained in phrases, I do not want to remove phrases, please see my terms output list and example stated below that list in my question. The final ranking of these terms are weighted / ranked by the tf/idf scores. Higher scored terms get to the top. @DeadMG; what are you exactly missing? I then try to shed some light in the darkness. – Andreas W. Wylach Jun 14 '12 at 11:02
0

I can suggest you to use the "erase-remove" idiom in this way:

struct YourConditionFunctor {
    bool operator()(TermNode* term) {
        if (/* you have to remove term */) {
           delete term;
           return true;
        }
        return false;
    }
};

and then write:

termlist.erase(
    remove_if(
        termlist.begin(),
        termlist.end(), 
        YourConditionFunctor()
    ), 
    termlist.end()
);
Luca Martini
  • 1,434
  • 1
  • 15
  • 35
  • 1
    Unfortunately, this is quite the easy part I fear. The heart of the problem (unlike what the title suggests) is identifying the terms to remove. Furthermore, I find it quite a bad practice to actually `delete` the item in the predicate. This goes against all regular usages of the predicates that normally do not modify the object they are called on, and sounds like a recipe for disaster. – Matthieu M. Jun 14 '12 at 09:36
  • Yes, have to agree with Matthieu, how to do this is shown in the questions I have stated in my question text. I think a bunch of answers here at stackoverflow show *how* to use this. My problem is more the how to cleanup term and last maybe to choose the right container which could use above idiom. – Andreas W. Wylach Jun 14 '12 at 10:17