How parse a text file and counting repeats in C++

Question

I'm trying to create a C++ program that gets log information from a text file like this:

local - - [24/Oct/1994:13:41:41 -0600] "GET index.html HTTP/1.0" 200 150
local - - [24/Oct/1994:13:41:41 -0600] "GET 1.gif HTTP/1.0" 200 1210
local - - [24/Oct/1994:13:43:13 -0600] "GET index.html HTTP/1.0" 200 3185
local - - [24/Oct/1994:13:43:14 -0600] "GET 2.gif HTTP/1.0" 200 2555
local - - [24/Oct/1994:13:43:15 -0600] "GET 3.gif HTTP/1.0" 200 36403
local - - [24/Oct/1994:13:43:17 -0600] "GET 4.gif HTTP/1.0" 200 441
local - - [24/Oct/1994:13:46:45 -0600] "GET index.html HTTP/1.0" 200 3185

Then I'm trying to get the file name which is after GET in each line, store it somewhere and count each time the file name is repeated in the log file.
After reading the file I print out the top 10 repeated file names.

My problem is that the code bellow counts for all lines in the log file - but that's not what I want: count file names between GET and HTTP.

#include <iostream>
#include <fstream>
#include <string>
#include <algorithm>
#include <time.h>
#include <math.h>

const long MAX = 1000000;
std::string words[MAX];
long instances[MAX];
long count = 0;

void insert(std::string input) {
    //check first, add if not present
    for (long i = 0; i < count; i++)
        if (input == words[i]) {
            instances[i]++;
            //std::cout << words[i] << std::endl;
            return;
        }

    if (count < MAX) {
        words[count] = input;
        instances[count] = 1;
        count++;
    }
    else
        std::cerr << "Too many unquie words in the file";
}

long findTop(std::string &word) {
    //int topIndex = 0;
    long topCount = instances[0];
    long topIndex = 0;

    for (long i = 1; i<count; i++)
        if (instances[i] > topCount) {
            topCount = instances[i];
            topIndex = i;
        }
    instances[topIndex] = 0;
    word = words[topIndex];
    //topIndex = i;
    return topCount;
}

long frequency_of_primes(long n) {
    long i, j;
    long freq = n - 1;
    for (i = 2; i <= n; ++i) for (j = sqrt(i); j>1; --j) if (i%j == 0) { --freq; break; }
    return freq;
}

int main()
{
    std::cout << "Please wait for the result!" << std::endl;
    std::string word;
    std::ifstream data("Text.txt");
    while (data >> word)
        insert(word);
    long topCount = 0;
    for (long i = 0; i<10; i++)
        //cout << words[i] << " " << instances[i] << endl;
        std::cout << " File Name: " << word << "  Visitors #" << findTop(word) << std::endl;
    clock_t t;
    long f;
    t = clock();
    printf("Calculating...\n");
    f = frequency_of_primes(99999);
    printf("The number of primes lower than 100,000 is: %d\n", f);
    t = clock() - t;
    printf("It took me %d clicks (%f seconds).\n", t, ((float)t) / CLOCKS_PER_SEC);
    return 0;
}

First of all stop [using raw arrays](https://stackoverflow.com/questions/46991224/are-there-any-valid-use-cases-to-use-new-and-delete-raw-pointers-or-c-style-arr) please. Also you might probably want a `std::map` to count the recurring matches. — user0042, Dec 17 '17 at 18:06
Break the problem down. To count the number of whatever's and get a tally of each one is usually done using a `std::map` or `std::unordered_map`. Doing research will show you this is a one or two line operation in a single loop. — PaulMcKenzie, Dec 17 '17 at 18:22
If your question is how to get the part between *GET* and *HTTP*, then just count the number of characters from the start to the *GET* string (44), then read until next whitespace. So in the loop where you're feeding from `data` into `word`, call a substring function between the two ends I pointed here. — Al.G., Dec 17 '17 at 22:17

score 0 · Answer 1 · answered Dec 17 '17 at 23:21

The function get_file_name() finds the first quotation in the header and the last quotation in the header and parses further down to the file name. This is essentially what @AI.G. suggested. However, you may want to see what C++ facilities offer regex support.

I also did not do any handling for the input or output files; this information has just been included as an example of using unordered_map as @PaulMcKenzie also suggested.

#include <iostream>
#include <fstream>
#include <unordered_map>

std::string get_file_name(const std::string& s) {
  std::size_t first = s.find_first_of("\"");
  std::size_t last = s.find_last_of("\"");

  std::string request = s.substr(first, first - last);

  std::size_t file_begin = request.find_first_of(' ');
  std::string truncated_request = request.substr(++file_begin);

  std::size_t file_end = truncated_request.find(' ');
  std::string file_name = truncated_request.substr(0, file_end);

  return file_name;
}


int main() {

  std::ifstream f_s("header_log.txt");
  std::string content;
  std::unordered_map<std::string, int> file_access_counts;

  while (std::getline(f_s, content)) {
    auto file_name = get_file_name(content);
    auto item = file_access_counts.find(file_name);

    if (item != file_access_counts.end()) {
      ++file_access_counts.at(file_name);
    } else {
      file_access_counts.insert(std::make_pair(file_name, 1));
    }
  }

  f_s.close();

  std::ofstream ofs;
  ofs.open ("output.txt", std::ofstream::out | std::ofstream::app);

  for (auto& n: file_access_counts)
    ofs << n.first << ", " << n.second << std::endl;

  ofs.close();

  return 0;
}

How parse a text file and counting repeats in C++

1 Answers1