c++ dealing with multiple strings in recv function for irc bot

Question

I am trying to write a simple irc bot in c++ (I have previously done this in python but I am struggling with dealing with strings using c++ especially unicode strings.)

So far I can connect to the IRC server and read the buffer, BUT the buffer can contain multiple lines, and it also contains a lot of null data. There is also a possibility of having wide characters or a single message line overflowing the buffer.

I want to read the buffer then process each string line by line for each '\n' terminated line.

#include "stdafx.h"
#include <stdio.h>
#include <string>
#include <iostream>

#ifdef _WIN32
#include <winsock2.h>
#include <ws2tcpip.h>
#pragma comment(lib,"ws2_32.lib")
#else
#include <unistd.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <netdb.h>
#endif

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

const char \
*pass = "pass",
*bot_owner = "name",
*nick = "name",
*serv = "irc.twitch.tv",
*chan = "#name";

using namespace std;


int main()
{

            int ret;
            char buf[512] = "";
#ifdef _WIN32
            SOCKET sock;
            struct WSAData* wd = (struct WSAData*)malloc(sizeof(struct WSAData));
            ret = WSAStartup(MAKEWORD(2, 0), wd);
            free(wd);
            if (ret) { puts("Error loading Windows Socket API"); return 1; }
#else
            int sock;
#endif
            struct addrinfo hints, *ai;
            memset(&hints, 0, sizeof(struct addrinfo));
            hints.ai_family = AF_UNSPEC;
            hints.ai_socktype = SOCK_STREAM;
            hints.ai_protocol = IPPROTO_TCP;
            if (ret = getaddrinfo(serv, "6667", &hints, &ai)) {
                //puts(gai_strerror(ret)); // this doesn't compile
                return 1;
            }
            sock = socket(ai->ai_family, ai->ai_socktype, ai->ai_protocol);
            if (ret = connect(sock, ai->ai_addr, ai->ai_addrlen)) {
                //puts(gai_strerror(ret)); // this doens't compile
                return 1;
            }
            freeaddrinfo(ai);
            sprintf_s(buf, "PASS %s\r\n", pass);
            send(sock, buf, strlen(buf), 0);
            sprintf_s(buf, "USER %s\r\n", nick);
            send(sock, buf, strlen(buf), 0);
            sprintf_s(buf, "NICK %s\r\n", nick);
            send(sock, buf, strlen(buf), 0);
            int bytesRecieved;
            while ((bytesRecieved = recv(sock, buf, 512, 0)) > 0) {

                std:cout << "\nbytesRecieved : " << bytesRecieved << "\n";
                std::cout << "DATA : " << buf;

                if (!strncmp(buf, "PING ", 5)) {
                    const char * pong = "PONG ";
                    send(sock, pong, strlen(pong), 0);
                }
                if (buf[0] != ':') continue;
                if (!strncmp(strchr(buf, ' ') + 1, "001", 3)) {
                    sprintf_s(buf, "JOIN %s\r\n", chan); 
                    send(sock, buf, strlen(buf), 0);
                }
            }
#ifdef _WIN32
            closesocket(sock);
            WSACleanup();
#else
            close(sock);
#endif

    return 0;
}

Whats the best way to split the recv buf into several strings if it contains many separated by '/n' ? and iterate over them? How can I deal with a potential string being split over the end of the buffer and beginning of the next one? And also how do I deal with utf-8 characters? Because the twitch irc bot accepts many different language characters?

Many thanks, my C++ skills are quite basic and I am mostly trying to convert this bot from a simple one I wrote in python which has lots of nice easy ways of dealing with strings. If you can explain things as if you are dealing with an idiot, I'd appreciate that.

---- edit ----

I think I need to do something like :

        for (int i = 0; i > bytesRecieved; i++) {

            string stringbuilder;
            stringbuilder.push_back(buf[i]);

        }

iterating through the char buffer and building up separate strings by reading them until the '/n' char then doing the next one and putting those into into a vector(?) of strings? Then interating over that vector, I don't know how to do this in c though any ideas? I've tried the boost library below but this always ends up creating a string at the end with a lot of nonsense chars in.

Also, this seems to only compile in visual studio when I select /MTd (multi-threaded debug) But I really want it to compile as a statically linked project eventually using MFC so it needs to compile with /MT (multi-threaded) Does anyone know why I get lots of unresolved exernal symbols when I do that? — Zac, Feb 16 '16 at 16:58
You say the `boost::tokenizer` library "always ends up creating a string at the end with a lot of nonsense chars in [it]." Can you give an SSCCE ( http://sscce.org/ ) on CoLiRu ( http://coliru.stacked-crooked.com/ ) demonstrating your usage? Include an example string you're passing to the tokenizer as well. — caps, Feb 23 '16 at 16:23

score 1 · Answer 1 · edited May 23 '17 at 12:22

1

I would check out boost::tokenizer for splitting the string into mulitple substrings to iterate over based on a delimiter. You'll need to store the string in a std::string to pass it to Tokenizer. Example:

using sep = boost::char_separator<char>;
using tokenizer = boost::tokenizer<sep>;
constexpr auto separators = "\n";
const auto socket_string = std::string(/*values from socket go here*/);
const auto tokens = tokenizer(socket_string, sep(separators));
for(const token : tokens)
/* 
 * this loop will iterate over all the lines received from the socket,
 * one line at a time
 */
{
    /* token represents a single line of input */
}

When it comes to strings being split over multiple buffers... you have to have some way to detect that. Where I work when we send messages over a socket, we preface the messages with an integer representing the number of bytes in the message, that way we can check the size of the received string to know if we're done or not. Without an API like that you'll have to decide on some way to parse the strings and decide if you've received everything yet. Or just leave it dumb and simple and parse each buffer as a new string. In your case, perhaps if the string you read off the buffer did not end in '\n', then it is not finished yet? That's probably what I would check for, but I don't know all your constraints.

How you deal with UTF-8 characters will depend on your platform. On *nix boxes I believe that std::string is UTF-8 encoded by default. On Windows you might need to use std::wstring.

Also, I'd suggest reading up on idiomatic C++. Your code is about 90% Pure C.

edited May 23 '17 at 12:22

Community

1
1

answered Feb 16 '16 at 16:40

caps

1,225
14
24

`std::string` is not encoded in any way, it's just a sequence of `char`s. Same for `std::wstring` and `wchar_t`. – alain Feb 16 '16 at 16:55
It is a sequence of `char`s. On Linux, it is also UTF-8 encoded: http://stackoverflow.com/a/402918/2025214 – caps Feb 16 '16 at 17:00
1

Yes, but it's the string literal `"olé"` that is encoded as UTF8. You can easily store invalid UTF8 codes in a `std::string`, and any `char` values you want. – alain Feb 16 '16 at 17:08
Of course. My point was that using `std::string` should handle UTF-8 characters as well as he needs to for simple purposes. You need something better than `std::string` for true Unicode handling. – caps Feb 16 '16 at 17:15
1

Ok, I see. Maybe my comment was a bit pedantic. – alain Feb 16 '16 at 17:18
Pedantic, but still important to remember. – caps Feb 16 '16 at 17:20
1

std::wstring encodes utf-16 (wchar_t) for Windows only, and is not generally required on Windows due to APIs that are backward-compatible with utf-8 (actually ascii, which is a subset of utf-8). All other modern platforms use utf-8. For utf-8 encoding, std::string is best. Neither std::string nor std::wstring provide any special handling for multiple-character encodings. Python will do some (but not all) of the special handling for you, C++ doesn't even attempt it. – Matt Jordan Feb 16 '16 at 17:20
I downloaded the boost library, unzipped it and added the folder to my additional libraries in visual studio. I used #include but my code will not compile. I'm getting the error : no instance of constructer on line : const auto tokens = tokenizer(socket_string, separators); – Zac Feb 17 '16 at 14:09
@Zac : Ah, I lead you wrong. There's an error in my example. I'll fix it. Check again in a second. – caps Feb 17 '16 at 16:12
You have to construct the `boost::char_separator` object (that we aliased as `sep`) from the `separators`, then pass that in as the second parameter. – caps Feb 17 '16 at 16:13

score 0 · Answer 2 · answered Feb 18 '16 at 14:40

In the end I solved the issue by iterating over the buf char array and pushing each char onto the end of a new string. When I encounter a '/n' char I add that new string into a vector and reset the string with the clear() function.

This continues until for the length of the char array until the index of returned by recv which indicates valid bytes.

The vector is then iterated over in a for loop.

        std::vector <string> vs;
        string newString;
        for (int i = 0; i < bytesRecieved; i++) {
            newString.push_back(buf[i]);
            if (buf[i] == '\n') {
                vs.push_back(newString);
                newString.clear();
            }

        }

        for (const auto &item_vs : vs) {
            // This is where the recv buffer lines are iterated over
            cout << "Value : ";
            cout << item_vs;
        }

c++ dealing with multiple strings in recv function for irc bot

2 Answers2