5

As per request of the fantastic fellas over at the C++ chat lounge, what is a good way to break down a file (which in my case contains a string with roughly 100 lines, and about 10 words in each line) and insert all these words into a std::set?

Nico Bellic
  • 363
  • 2
  • 4
  • 13
  • I'm not sure what you mean by the verb "*to index*". Perhaps you meant "and **insert** all these words into a std::set?" – Robᵩ Jun 21 '12 at 20:45

3 Answers3

25

The easiest way to construct any container from a source that holds a series of that element, is to use the constructor that takes a pair of iterators. Use istream_iterator to iterate over a stream.

#include <set>
#include <iostream>
#include <string>
#include <algorithm>
#include <iterator>

using namespace std;

int main()
{
  //I create an iterator that retrieves `string` objects from `cin`
  auto begin = istream_iterator<string>(cin);
  //I create an iterator that represents the end of a stream
  auto end = istream_iterator<string>();
  //and iterate over the file, and copy those elements into my `set`
  set<string> myset(begin, end);

  //this line copies the elements in the set to `cout`
  //I have this to verify that I did it all right
  copy(myset.begin(), myset.end(), ostream_iterator<string>(cout, "\n"));
  return 0;
}

http://ideone.com/iz1q0

Mooing Duck
  • 64,318
  • 19
  • 100
  • 158
  • I'd _just_ finished writing after seeing Drise's sample in chat – Mooing Duck Jun 21 '12 at 20:45
  • 1
    I would like to clarify that I chose Drise's answer over this because this kind of question falls more into the "beginner" category, and MooingDuck's answer seems to be a bit more advanced (future readers, including myself at the moment, might not be able to understand it). @MooingDuck I absolutely appreciate it though. – Nico Bellic Jun 21 '12 at 21:01
  • 5
    @Nico : On the other hand, _this_ code is idiomatic; if I saw Drise's code in real, production code, I'd scratch my head and wonder why they took such a verbose approach and probably end up rewriting it this way. – ildjarn Jun 21 '12 at 21:08
  • I don't understand this code 100% yet. But I think the magical word splitting happens in `istream_iterator(cin)`. Is there a similar way to split all words in a given `std::string`? – Lukas Schmelzeisen Jun 21 '12 at 21:09
  • 2
    @LukasSchmelzeisen: `ostringstream` is designed for exactly that purpose. It makes a `string` that acts like a `ostream` – Mooing Duck Jun 21 '12 at 21:10
  • 3
    @Drise : Sorry, but this code is simply not confusing or advanced. – ildjarn Jun 21 '12 at 21:12
  • 3
    @Drise: Once you get more familiar with using iterators, and the `` header, this code is _way_ easier to understand than your code. – Mooing Duck Jun 21 '12 at 21:12
  • 4
    @Drise: How can this be any easier to read. You iterating over a stream and putting it in a container (when you read a string from a stream one word is retrieved so we are reading words and putting them into a set). This is much easier to read and understand than your code. – Martin York Jun 21 '12 at 23:21
3

Assuming you've read your file into a string, boost::split will do the trick:

#include <set>
#include <boost/foreach.hpp>
#include <boost/algorithm/string.hpp>

std::string astring = "abc 123 abc 123\ndef 456 def 456";  // your string
std::set<std::string> tokens;                              // this will receive the words
boost::split(tokens, astring, boost::is_any_of("\n "));    // split on space & newline

// Print the individual words
BOOST_FOREACH(std::string token, tokens){
    std::cout << "\n" << token << std::endl;
}

Lists or Vectors can be used instead of a Set if necessary.

Also note this is almost a dupe of: Split a string in C++?

Community
  • 1
  • 1
Josh
  • 444
  • 2
  • 7
2
#include <set>
#include <iostream>
#include <string>

int main()
{
  std::string temp, mystring;
  std::set<std::string> myset;

  while(std::getline(std::cin, temp))
      mystring += temp + ' ';
  temp = "";      

  for (size_t i = 0; i < mystring.length(); i++)
  {
    if (mystring.at(i) == ' ' || mystring.at(i) == '\n' || mystring.at(i) == '\t')
    {
      myset.insert(temp);
      temp = "";
    }
    else
    {
      temp.push_back(mystring.at(i));
    }
  }
  if (temp != " " || temp != "\n" || temp != "\t")
    myset.insert(temp);

  for (std::set<std::string>::iterator i = myset.begin(); i != myset.end(); i++)
  {
    std::cout << *i << std::endl;
  }
  return 0;
}

Let's start at the top. First off, you need a few variables to work with. temp is just a placeholder for the string while you build it from each character in the string you want to parse. mystring is the string you are looking to split up and myset is where you will be sticking the split strings.

So then we read the file (input through < piping) and insert the contents into mystring.

Now we want to iterate down the length of the string, searching for spaces, newlines, or tabs to split the string up with. If we find one of those characters, then we need to insert the string into the set, and empty our placeholder string, otherwise, we add the character to the placeholder, which will build up the string. Once we finish, we need to add the last string to the set.

Finally, we iterate down the set, and print each string, which is simply for verification, but could be useful otherwise.

Edit: A significant improvement on my code provided by Loki Astari in a comment which I thought should be integrated into the answer:

#include <set>
#include <iostream>
#include <string>

int main()
{
  std::set<std::string> myset;
  std::string word;

  while(std::cin >> word)
  {
      myset.insert(std::move(word));
  }

  for(std::set<std::string>::const_iterator it=myset.begin(); it!=myset.end(); ++it)
    std::cout << *it << '\n';
}
Community
  • 1
  • 1
Drise
  • 4,310
  • 5
  • 41
  • 66
  • While Dirse's code is more verbose it seems to be running a lot faster than Mooning Duck's Code – Lukas Schmelzeisen Jun 21 '12 at 22:04
  • Less overhead than letting `` do it for me? – Drise Jun 21 '12 at 22:10
  • Too much manual work. This should be about 3 lines long. You are writting C and using a couple of C++ containers. This is what we refer to as C with classes. – Martin York Jun 21 '12 at 23:16
  • @LukasSchmelzeisen: If that is true you are doing something else wrong. – Martin York Jun 22 '12 at 00:08
  • @LokiAstari: There's only one 'n' in my name :P And I agree, from what I see, my code ought to be far faster than Drise's. Here's a version based on Drise's that ought to be faster than both of our answers that I coded for no real reason: http://ideone.com/VzQi5 – Mooing Duck Jun 22 '12 at 00:10
  • I prefer just a simple loop. http://ideone.com/oN73c. The time you save building the word manually will not be significant enough to time the difference. Though the std::move may help in C++0x – Martin York Jun 22 '12 at 00:36
  • @LokiAstari Unfortunately, this is the way my university teaches things, so I'm still in that mindset. However, I'm working diligently to break the habit. – Drise Jun 22 '12 at 16:44