0

I need to write code to store the unique characters and their frequencies in a dynamic array. I need to increase its size as new data comes in. New data in this case will be a new character that is encountered. The algorithm I have in mind is to check the list of known characters every single time I read from the given string. If it is a new character I need to increase the array size by 1. If it is not a new character I will increase its frequency. It is an array of struct letter (in the code below). The problem is that, I spent quite a lot of time with this and had issues with implementing it. So the question is how can I exactly implement it? Thank you spending time to help.

#include <iostream>
#include <string>
#include <bitset>
#define ARR_LEN(arr) sizeof(arr)/sizeof(arr[0])

using namespace std;

struct unique_char {
    char character;
    int frequency;
};

int main() {
    int char_count;
    string str;
    getline(cin, str);

    struct unique_char* chars = new struct unique_char[100];

    system("PAUSE");
    exit(0);
}
user4581301
  • 33,082
  • 7
  • 33
  • 54
  • 6
    Use a `std::map` or a `std::unordered_map`. – EOF May 27 '21 at 21:17
  • 1
    Does this answer your question? https://stackoverflow.com/questions/64049909/c-how-to-count-element-frequency-smartly – 463035818_is_not_an_ai May 27 '21 at 21:22
  • 1
    @463035818_is_not_a_number The question is mistitled, and the referenced dupe is a bad match for what OP is actually asking. –  May 27 '21 at 21:24
  • Side note: [`std::size`](https://en.cppreference.com/w/cpp/iterator/size), added in C++17, is an elegant and safer replacement for the `ARR_LEN` macro. Note that neither will function if the array has been [decayed to a pointer](https://stackoverflow.com/questions/1461432/what-is-array-to-pointer-decay) – user4581301 May 27 '21 at 21:25
  • @user4581301 `std::size` wont work with a pointer to dynamically allocated array (though the macro wont work as well) – 463035818_is_not_an_ai May 27 '21 at 21:42
  • Since characters have a limited encoding range, using an array would be more efficient and straight forward. No need to use a linked list or other schema that `std::map` uses. Simple indexing calculation into an array is more efficient. – Thomas Matthews May 27 '21 at 22:07
  • 1
    @ThomasMatthews On the flip side, clean semantics are more useful long-term than efficiency unless there's a need for it. –  May 27 '21 at 22:08
  • @GianPaolo a map's the simplest, cleanest way to go. The comment about sadism was with respect to using `std::size` on a dynamically allocated array when `std::vector` could have been an option. – user4581301 May 28 '21 at 21:57
  • @user4581301 you are right, I didn't catch the meaning of your comment. Yes, if a student is able to use somehow `std::size`, it means it has already learned what is a `std::vector` – Gian Paolo May 29 '21 at 06:45

2 Answers2

2

As mentionned in the comments, using std::map makes this fairly straightforward.

One of the "fun" things about map is that the indexing operator creates new values "on demand" with a initial value of 0 for ints. So the actual code is essentially one line: chars[c] += 1;

#include <map>
#include <iostream>
#include <string>

using namespace std;

int main() {
  map<char, int> chars;

  string str;
  getline(cin, str);

  for(char c: str) {
    chars[c] += 1;
  }

  for(auto [character, frequency]: chars) {
      cout << character << " : " << frequency << "\n";
  }
}

N.B. There is one major difference between this and @ThomasMatthews's answer:

The map will only contain the characters that have been seen, whereas the array will contain 0s for all characters that were never hit. Which approach you use should be based on which of the two are more useful to you.

  • Good solution. BTW, if you use "map", then in the output, the characters will be automatically sorted and displayed in alphabetic order. But, if you use "unordered_map", then in the output, the characters will NOT be sorted and displayed in alphabetic order. Also, for search operation, the time complexity is different between "map" and "unordered_map". – Job_September_2020 May 27 '21 at 22:08
  • Also, perhaps, you don't need to write "using namespace std" at the top of the file as you already write "std::cout" at the end of the file. People say that "using namespace std" may include too many unnecessary header files. BTW, this code requires C++11 or C++17 to compile due to "for(auto [character, frequency]: chars )". I like your solution. – Job_September_2020 May 27 '21 at 22:12
  • 1
    @Job_September_2020 I wouldn't use `using namespace` ever personally, but I want to keep OPs code as is as much as possible so that they can read the answer more easily. (I do agree that in this context, the `std::` should be dropped from the cout, that was just a reflex). –  May 27 '21 at 22:13
  • 1
    @Job_September_2020 `for (auto [character, frequency]: chars)` requires C++17 due to use of a structured binding. For C++11 and C++14, you can use `for (auto &p: chars) { std::cout << p.first << " : " << p.second << "\n"; }` instead. – Remy Lebeau May 27 '21 at 22:13
  • @RemyLebeau Pedantically, it should be `for(const auto& p: chars) {...}` –  May 27 '21 at 22:18
  • @Frank, Please help me with 2 questions. (1) By default, when you first insert a character into the map, is the frequency for that character automatically set to 0 always ? - ((2) BTW, unrelatedly, if you define an integer as "int a ; " without setting the initial value to 0, sometimes, it may have a default value that is NOT a zero ?) – Job_September_2020 May 27 '21 at 22:41
  • 1
    @Job_September_2020 The map uses [value initialization](https://en.cppreference.com/w/cpp/language/value_initialization) (this page covers both of your questions) in these scenarios, as you can see [here](https://en.cppreference.com/w/cpp/container/map/operator_at). –  May 27 '21 at 22:43
  • 1
    @Job_September_2020 And to clarify wrt/ your second question, from the PoV of the language, the value of an uninitialized variable is not "0", or "sometimes not 0". The value **doesn't exist at all**, and attempting to read it is Undefined Behavior. –  May 27 '21 at 22:50
2

Using an array makes things straight forward:

unsigned int frequencies[256] = {0};
while (std::getline(std::cin, str))
{
   const size_t length = str.length();
   for (unsigned int i = 0; i < length; ++i)
   {
       const char c = str[i];
       ++frequencies[c];
   }
}

Although, you may want to improve efficiency:

const size_t BUFFER_SIZE = 1024u * 1024u;
//...
char buffer[BUFFER_SIZE] = {0};
while (std::cin.read(&buffer[0], BUFFER_SIZE)
{
    const size_t chars_read = cin.gcount();
    for (unsigned int i = 0; i < chars_read; ++i)
    {
         const char c = buffer[i];
         ++frequencies[c];
    }
}

The above code uses block reading to improve input performance. No scanning for newline characters, just read straight into memory. Determine the frequencies from the characters in memory.

Edit 1: unsigned char
From the comments, an unsigned char may be a safer data type than char because char can be signed. This may be an issue when accessing the array slots because a signed char could be negative and negative indices are usually a bad thing. When you run it, if there are issues, replace the char type with unsigned char.

Thomas Matthews
  • 56,849
  • 17
  • 98
  • 154
  • 3
    Maybe use `unsigned char`? – Dmitri May 27 '21 at 22:08
  • @doug: The `const c` should be `const char c`. The compiler will optimize out the temporary assignment and combine with the other statement. – Thomas Matthews May 27 '21 at 22:38
  • 2
    Agree with @Dmitri `char` is dangerous unless it is known that the system implements it as unsigned. A `c` char, if it's signed like in x86, with the high order bit set will access outside of the array. At best it's UB. – doug May 28 '21 at 01:54