I have a vector of objects (objects are term nodes that amongst other fields contai a string field with the term string)
class TermNode {
private:
std::wstring term;
double weight;
...
public:
...
};
After some processing and calculating the scores these objects get finally stored in a vector of TermNode pointers such as
std::vector<TermNode *> termlist;
A resulting list of this vector, containing up to 400 entries, looks like this:
DEBUG: 'knowledge' term weight=13.5921
DEBUG: 'discovery' term weight=12.3437
DEBUG: 'applications' term weight=11.9476
DEBUG: 'process' term weight=11.4553
DEBUG: 'knowledge discovery' term weight=11.4509
DEBUG: 'information' term weight=10.952
DEBUG: 'techniques' term weight=10.4139
DEBUG: 'web' term weight=10.3733
...
What I try to do is to cleanup that final list for substrings also contained in phrases inside the terms list. For example, looking at the above list snippet, there is the phrase 'knowledge discovery' and therefore I would like to remove the single terms 'knowledge' and 'discovery', because they are also in the list and redundant in this context. I want to keep the phrases containing the single terms. I am also thinking about to remove all strings equal or less 3 characters. But that is just a thought for now.
For this cleanup process I would like to code a class using remove_if / find_if (using the new C++ lambdas) and it would be nice to have that code in a compact class.
I am not really sure on how to solve this. The problem is that I first would have to identify what strings to remove, by probably setting a flag as an delete marker. That would mean I would have to pre-process that list. I would have to find the single terms and the phrases that contain one of those single terms. I think that is not an easy task to do and would need some advanced algorithm. Using a suffix tree to identify substrings?
Another loop on the vector and maybe a copy of the same vector could to the clean up. I am looking for something most efficient in a time manner.
I been playing with the idea or direction such as showed in std::list erase incompatible iterator using the remove_if / find_if and the idea used in Erasing multiple objects from a std::vector?.
So the question is basically is there a smart way to do this and avoid multiple loops and how could I identify the single terms for deletion? Maybe I am really missing something, but probably someone is out there and give me a good hint.
Thanks for your thoughts!
Update
I implemented the removal of redundant single terms the way Scrubbins recommended as follows:
/**
* Functor gets the term of each TermNode object, looks if term string
* contains spaces (ie. term is a phrase), splits phrase by spaces and finally
* stores thes term tokens into a set. Only term higher than a score of
* 'skipAtWeight" are taken tinto account.
*/
struct findPhrasesAndSplitIntoTokens {
private:
set<wstring> tokens;
double skipAtWeight;
public:
findPhrasesAndSplitIntoTokens(const double skipAtWeight)
: skipAtWeight(skipAtWeight) {
}
/**
* Implements operator()
*/
void operator()(const TermNode * tn) {
// --- skip all terms lower skipAtWeight
if (tn->getWeight() < skipAtWeight)
return;
// --- get term
wstring term = tn->getTerm();
// --- iterate over term, check for spaces (if this term is a phrase)
for (unsigned int i = 0; i < term.length(); i++) {
if (isspace(term.at(i))) {
if (0) {
wcout << "input term=" << term << endl;
}
// --- simply tokenze term by space and store tokens into
// --- the tokens set
// --- TODO: check if this really is UTF-8 aware, esp. for
// --- strings containing umlauts, etc !!
wistringstream iss(term);
copy(istream_iterator<wstring,
wchar_t, std::char_traits<wchar_t> >(iss),
istream_iterator<wstring,
wchar_t, std::char_traits<wchar_t> >(),
inserter(tokens, tokens.begin()));
if (0) {
wcout << "size of token set=" << tokens.size() << endl;
for_each(tokens.begin(), tokens.end(), printSingleToken());
}
}
}
}
/**
* return set of extracted tokens
*/
set<wstring> getTokens() const {
return tokens;
}
};
/**
* Functor to find terms in tokens set
*/
class removeTermIfInPhraseTokensSet {
private:
set<wstring> tokens;
public:
removeTermIfInPhraseTokensSet(const set<wstring>& termTokens)
: tokens(termTokens) {
}
/**
* Implements operator()
*/
bool operator()(const TermNode * tn) const {
if (tokens.find(tn->getTerm()) != tokens.end()) {
return true;
}
return false;
}
};
...
findPhrasesAndSplitIntoTokens objPhraseTokens(6.5);
objPhraseTokens = std::for_each(
termList.begin(), termList.end(), objPhraseTokens);
set<wstring> tokens = objPhraseTokens.getTokens();
wcout << "size of tokens set=" << tokens.size() << endl;
for_each(tokens.begin(), tokens.end(), printSingleToken());
// --- remove all extracted single tokens from the final terms list
// --- of similar search terms
removeTermIfInPhraseTokensSet removeTermIfFound(tokens);
termList.erase(
remove_if(
termList.begin(), termList.end(), removeTermIfFound),
termList.end()
);
for (vector<TermNode *>::const_iterator tl_iter = termList.begin();
tl_iter != termList.end(); tl_iter++) {
wcout << "DEBUG: '" << (*tl_iter)->getTerm() << "' term weight=" << (*tl_iter)->getNormalizedWeight() << endl;
if ((*tl_iter)->getNormalizedWeight() <= 6.5) break;
}
...
I could'nt use the C++11 lambda syntax, because on my ubuntu servers have currently g++ 4.4.1 installed. Anyways. It does the job for now. The way to go is to check the quality of the resulting weighted terms with other search result sets and see how I can improve the quality and find a way to boost the more relevant terms in conjunction with the original query term. It might be not an easy task to do, I wish there would be some "simple heuristics". But that might be another new question when stepped further a little more :-)
So thanks to all for this rich contribution of thoughts!