0

I'm writing a program for a personal project that wants to take a list of words from google books and their occurrences and put them into a vector with their occurrences attached so I can whittle the list down some. The list of words is formatted such that it has the word, a \t character, the number, a newline (\n), and it repeats. I don't have much experience with this type of programming, I was wondering how someone may parse a file that's formatted this way. Here's what I have so far:

#include <iostream>
#include <string>
#include <fstream>
#include <vector>

#define FILE_NAME

using namespace std;

// structure denoting a word occurence
// contains the string of the word and an integer representing its frequency
struct word_occ {
    String word;
    int occurence;
};

vector<word_occ> words_vector;


int main() {
    /*
    File is a .txt file that has the following format:
    word1  #####
    word2  #####

    where word is the word from the english 1-grams from google books
    and ##### is the number of occurences.
    The word is separated from it's occurences by a tab (\t) and other words by a newline (\n).
    All words are entirely lowercase, and all numbers are integers lower than 20,000,000
    */
    ifstream all_words_list(FILE_NAME);
    
    string line;

    string line_word;
    int line_occurence;

    word_occ this_line;

    while (getline(all_words_list, line)) {

        // ... <-- what goes here?

        this_line.word = line_word;
        this_line.occurence = line_occurence;
        words_vector.push_back(this_line);
    }
}
Thermobyte
  • 51
  • 5

1 Answers1

0

A string stream would likely work:

while (getline(all_words_list, line)) {
    std::istringstream ss(line);
    ss >> line_word;
    ss >> line_occurence; 

    ...
selbie
  • 100,020
  • 15
  • 103
  • 173
  • 1
    That'll break on *whitespace* in the line, but the OP wants to break on *tabs*. `istringstream ss(line); string field; while (getline(ss, field, '\t')) {...}` – Eljay Mar 08 '22 at 17:54
  • Good point. When I saw `word`, I assumed it was exactly a single word with no whitespace. – selbie Mar 08 '22 at 19:04