-3

I would like to know how to remove duplicate strings from a container, but ignore word differences from trailing punctuation.

For example given these strings:

Why do do we we here here?

I would like to get this output:

Why do we here?

Jonathan Mee
  • 37,899
  • 23
  • 129
  • 288
Shubham
  • 25
  • 9
  • 1
    [Tokenize string.](https://stackoverflow.com/questions/53849/how-do-i-tokenize-a-string-in-c) – Mahesh Aug 11 '17 at 14:05
  • 1
    Possible duplicate of [Most elegant way to split a string?](https://stackoverflow.com/questions/236129/most-elegant-way-to-split-a-string) – Leonardo Alves Machado Aug 11 '17 at 14:06
  • @Leonardo Can you tell me how? – Shubham Aug 11 '17 at 14:11
  • Do you know stream input (`cin >> x;`)? Do you know how to enlarge an array? What do you mean *compare*, do you mean test for equality? – Beta Aug 11 '17 at 14:13
  • @Beta yes the test for equality. – Shubham Aug 11 '17 at 14:14
  • `if(str[3] == "here") {...}` – Beta Aug 11 '17 at 14:15
  • @beta the question is how to remove duplicates from the string Example: why you here here? answer: why you here? so i want to remove "here" from the sentence but when i compare them they are different because of "?". – Shubham Aug 11 '17 at 14:17
  • 1
    Documentation has a great article on tokenization: https://stackoverflow.com/documentation/c%2b%2b/488/stdstring/2148/tokenize By a great author ;) Perhaps looking over that would be helpful. You may be able to solve your problem on your own after reading that. If not you really need to edit the question to clarify. Are you: 1) Asking how to tokenize a string? 2) Asking how to compare strings? 3) Asking how to chop punctuation from words? 4) Asking how to remove duplicate strings from a container? Note that you should have said yes to only 1 of these or your question is too broad. – Jonathan Mee Aug 11 '17 at 14:23
  • @Shubham have you clicked on the link? There you can find several ways to split a string in c++. Use the one you like the most – Leonardo Alves Machado Aug 11 '17 at 14:26
  • @JonathanMee 4 how to remove duplicates from the sentence and the twist is in the last word. – Shubham Aug 11 '17 at 14:27
  • 1
    @Shubham So you're really asking 3 *and* 4. Still probably too much for a question but, at least edit it so it's clear that you're not asking how to tokenize a string. – Jonathan Mee Aug 11 '17 at 14:32
  • 1
    Do you also want to normalize capitalization? – Daniel H Aug 11 '17 at 15:01
  • Use `std::set` to contain your words. The `std::set` doesn't allow duplicates. – Thomas Matthews Aug 11 '17 at 15:21
  • @ThomasMatthews A `set` doesn't preserve order. – Jonathan Mee Aug 11 '17 at 16:29

2 Answers2

0

The algorithm:

  1. While Reading a word is successful, do:
  2. If End of file, quit.
  3. If word list is empty, push back word.
  4. else begin
    Search word list for the word.
  5. if word doesn't exist, push back the word.
    end else (step 4)
  6. end (while reading a word)

Use std::string for your word. This allows you to do the following:

std::string word;
while (data_file >> word)
{
}

Use std::vector to contain your words (although you could use std::list as well). The std::vector grows dynamically so you don't have to worry about reallocation if you picked the wrong size.
To append to std::vector, use the push_back method.

To compare std::string, use operator==:

std::string new_word;
std::vector<std::string> word_list;
//...
if (word_list[index] == new_word)
{
  continue;
}
Thomas Matthews
  • 56,849
  • 17
  • 98
  • 154
0

So you have said you know how to tokenize a string. (If you don't spend some time here: https://stackoverflow.com/a/38595708/2642059) So I'm going to assume that we're given a vector<string> foo which contains words with possibly trailing punctuation.

for(auto it = cbegin(foo); it != cend(foo); ++it) {
    if(none_of(next(it), cend(foo), [&](const auto& i) {
                                                         const auto finish = mismatch(cbegin(*it), cend(*it), cbegin(i), cend(i));
                                                         return (finish.first == cend(*it) || !isalnum(*finish.first)) && (finish.second == cend(i) || !isalnum(*finish.second));
                                                        })) {
        cout << *it << ' ';
    }
}

Live Example

It's worth noting here that you haven't given us rules on how to handle words like: "down", "down-vote", and "downvote" This algorithm presumes that the 1st 2 are equal. You also haven't given us rules for how to handle: "Why do, do we we here, here?" This algorithm always returns the final repetition, so the output would be "Why do we here?"

If the presumptions made by this algorithm are not totally to your liking leave me a comment and we'll work on getting you comfortable with this code to where you can make the adjustments that you need.

Jonathan Mee
  • 37,899
  • 23
  • 129
  • 288
  • I am just a beginner.So I will try to understand the code.Thanks for the reply. – Shubham Aug 12 '17 at 09:06
  • @Shubham I'd encourage you to spend some time with this, as I believe it's the best solution for your question. I've provided the Live Example which you can fork and try out different things with. Let me know if there is anything specific that I can explain to you. – Jonathan Mee Aug 12 '17 at 22:39