A better way to split a string into an array of strings in C/C++ using whitespace as a delimiter

Question

Sorry, my C/C++ is not that good, but the following existing code looks like garbage even to me. It also has a bug - fails when str = "07/02/2010" terminated by '\0' - . I think that instead of fixing a bug, it might as well be rewritten. In Python it is just 'kas\nhjkfh kjsdjkasf'.split(). I know this is C-ish code, but it cannot be that complicated to split a string! Sticking to the same signature, and without using extra libraries, how can I improve it - make it short and sweet? I can tell that this code smells, for instance because of the else clause all the way at the end.

LINE THAT FAILS:

_tcsncpy_s(
    s.GetBuffer((int) (nIndex-nLast)),
    nIndex-nLast,
    psz+nLast,
    (size_t) (nIndex-nLast)
);

With the string "07/02/2010" terminated by '\0' it will try to write 11 characters into a buffer that is only 10 characters long.

FULL FUNCTION:

#define 

// This will return the text string as a string array
// This function is called from SetControlText to parse the
// text string into an array of CStrings that the control
// Gadgets will attempt to interpret

BOOL CLVGridDateTimeCtrl::ParseTextWithCurrentFormat(const CString& str, const CGXStyle* pOldStyle, CStringArray& strArray )
{
    // Unused:
    pOldStyle;

    // we assume that the significant segments are seperated by space

    // Please change m_strDelim to add other delimiters

    CString s;

    LPCTSTR psz = (LPCTSTR) str;

    BOOL bLastCharSpace = FALSE;
    DWORD size = str.GetLength()+1;

    // (newline will start a new row, tab delimiter will
    // move to the next column).
    // parse buffer (DBCS aware)
    for (DWORD nIndex = 0, nLast = 0; nIndex < size; nIndex += _tclen(psz+nIndex))
    {
        // check for a delimiter
        if (psz[nIndex] == _T('\0') || _tcschr(_T("\r\n"), psz[nIndex]) || _tcschr(_T(" "), psz[nIndex])
            ||!_tcscspn(&psz[nIndex], (LPCTSTR)m_strDelim))
        {
            s.ReleaseBuffer();
            s.Empty();
            // abort parsing the string if next char
            // is an end-of-string
            if (psz[nIndex] == _T('\0'))
            {
                if (psz[nIndex] == _T('\r') && psz[nIndex+1] == _T('\n'))
                    nIndex++;

                _tcsncpy_s(s.GetBuffer((int) (nIndex-nLast)),
                    nIndex-nLast,
                            psz+nLast,
                            (size_t) (nIndex-nLast));
                CString temStr = s;
                strArray.Add(temStr);
                temStr.Empty();
                break;
            }

            else if (_tcscspn(&psz[nIndex], (LPCTSTR)m_strDelim) == 0 && !bLastCharSpace)
            {
                if (psz[nIndex] == _T('\r') && psz[nIndex+1] == _T('\n'))
                    nIndex++;

                _tcsncpy_s(s.GetBuffer((int) (nIndex-nLast)),
                    nIndex-nLast,
                            psz+nLast,
                            (size_t) (nIndex-nLast));
                CString temStr = s;
                strArray.Add(temStr);
                temStr.Empty();
                bLastCharSpace = TRUE;
                // abort parsing the string if next char
                // is an end-of-string
                if (psz[nIndex+1] == _T('\0'))
                    break;

            }
            // Now, that the value has been copied to the cell,
            // let's check if we should jump to a new row.
            else if (_tcschr(_T(" "), psz[nIndex]) && !bLastCharSpace)
            {
                if (psz[nIndex] == _T('\r') && psz[nIndex+1] == _T('\n'))
                    nIndex++;

                _tcsncpy_s(s.GetBuffer((int) (nIndex-nLast)),
                    nIndex-nLast,
                            psz+nLast,
                            (size_t) (nIndex-nLast));
                CString temStr = s;
                strArray.Add(temStr);
                temStr.Empty();
                bLastCharSpace = TRUE;
                // abort parsing the string if next char
                // is an end-of-string
                if (psz[nIndex+1] == _T('\0'))
                    break;
            }

            nLast = nIndex + _tclen(psz+nIndex);


        }
        else
        {   
            // nLast = nIndex + _tclen(psz+nIndex);
            bLastCharSpace = FALSE;
        }
    }
    if (strArray.GetSize())
        return TRUE;
    else
        return FALSE;
}

EDIT: m_strDelim = _T(","); and this member variable is used in this function only. I suppose I see the point of tokenization now - it tries to parse a date and time ... wait, there is more! Here is the code which calls this function below. Please help me improve this as well. Some of my co-workers claim that C# makes them no more productive than C++. I used to feel like an idiot for not being able to say the same about me.

// SetControlText will attempt to convert the text to a valid date first with
// the help of COleDateTime and then with the help of the Date control and the
// current format

BOOL CLVGridDateTimeCtrl::ConvertControlTextToValue(CString& str, ROWCOL nRow, ROWCOL nCol, const CGXStyle* pOldStyle)
{
    CGXStyle* pStyle = NULL;
    BOOL bSuccess = FALSE;

    if (pOldStyle == NULL)
    {
        pStyle = Grid()->CreateStyle();
        Grid()->ComposeStyleRowCol(nRow, nCol, pStyle);
        pOldStyle = pStyle;
    }

    // allow only valid input
    {
        // First do this
        CLVDateTime dt;

        if (str.IsEmpty())
        {
            ;
            // if (Grid()->IsCurrentCell(nRow, nCol))
            //  Reset();
            bSuccess = TRUE;
        }
        else if (dt.ParseDateTime(str,CLVGlobals::IsUSDateFormat()) && (DATE) dt != 0)
        {
            SetDateTime(dt);
            if (m_bDateValueAsNumber)
                str.Format(_T("%g"), (DATE) dt);
            else
                str = dt.Format();
            bSuccess = TRUE;
        }
        else
        {
            // parse the string using the current format
            CStringArray strArray;
            if (!ParseTextWithCurrentFormat(str, pOldStyle, strArray))
                return FALSE;

            UpdateNullStatus(m_TextCtrlWnd);

            SetFormat(m_TextCtrlWnd, *pOldStyle);

            int nArrIndex = 0;
            for(int i=0; i<m_TextCtrlWnd.m_gadgets.GetSize(); i++)
            {
                int val = m_TextCtrlWnd.m_gadgets[i]->GetValue();   
                // s.Empty();
                if(m_TextCtrlWnd.m_gadgets[i]->IsKindOf(RUNTIME_CLASS(SECDTNumericGadget)))
                {
                    // TRACE(_T("The value %s\n"), strArray[nArrIndex]);
                    ((CLVDTNumericGadget*)m_TextCtrlWnd.m_gadgets[i])->m_nNewValue = _ttoi(strArray[nArrIndex]);    
                    nArrIndex++;
                    if (nArrIndex>strArray.GetUpperBound())
                            break;
                }
                else if(m_TextCtrlWnd.m_gadgets[i]->IsKindOf(RUNTIME_CLASS(SECDTListGadget)) && val!=-1)
                {
                    int nIndex = ((CLVDTListGadget*)m_TextCtrlWnd.m_gadgets[i])->FindMatch(strArray[nArrIndex], ((CLVDTListGadget*)m_TextCtrlWnd.m_gadgets[i])->GetValue()+1);
                    if (nIndex!=-1)
                    {
                        // TRACE(_T("The value %s\n"), strArray[nArrIndex]);
                        ((CLVDTListGadget*)m_TextCtrlWnd.m_gadgets[i])->SetValue(nIndex);
                        nArrIndex++;
                        if (nArrIndex>strArray.GetUpperBound())
                            break;
                    }

                }

                CLVDBValue dbDate = m_TextCtrlWnd.GetDateTime();
                if (dbDate.IsNull())
                    str = _T("");
                else
                {
                    CLVDateTime dt = (CLVDateTime)dbDate;
                    if (m_bDateValueAsNumber)
                        str.Format(_T("%g"), (DATE) dt);
                    else
                        str = dt.Format();
                }
            }
            bSuccess = TRUE;
        }
    }

    if (pStyle)
        Grid()->RecycleStyle(pStyle);

    return bSuccess;
}

score 14 · Answer 1 · answered Jul 02 '10 at 02:28

14

The String Toolkit Library (Strtk) has the following solution to your problem:

#include <string>
#include <deque>
#include "strtk.hpp"
int main()
{ 
   std::string data("kas\nhjkfh kjsdjkasf");
   std::deque<std::string> str_list;
   strtk::parse(data, ", \r\n", str_list);
   return 0;
}

More examples can be found Here

answered Jul 02 '10 at 02:28

Hm ... I wonder if I can steal just a couple of headers and cpp files without installing the whole thing. – Hamish Grubijan Jul 02 '10 at 02:53
19

@Hamish: Feel free to take what you like, its all under CPL. If you don't have Boost, or just use vanilla C++ with STL you can just comment out #define ENABLE_LEXICAL_CAST #define ENABLE_RANDOM #define ENABLE_REGEX, and its should all work, its all explained in the readme.txt. – Jul 02 '10 at 03:43

Jerry Coffin · Answer 2 · 2010-07-02T20:17:22.163

In C++, it's probably easiest to use a stsringstream:

std::istringstream buffer("kas\nhjkfh kjsdjkasf");

std::vector<std::string> strings;

std::copy(std::istream_iterator<std::string>(buffer),
          std::istream_iterator<std::string>(),
          std::back_inserter(strings));

I haven't tried to stick to exactly the same signature, mostly because most of it is non-standard, so it doesn't apply to C++ in general.

Another possibility would be to use Boost::tokenizer, though obviously that does involve another library, so I won't try to cover it in more detail.

I'm not sure if that qualifies as "bizarro syntax" or not. I may have to work a bit on that part...

Edit: I've got it -- initialize the vector instead:

std::istringstream buffer("kas\nhjkfh kjsdjkasf");

std::vector<std::string> strings(
    (std::istream_iterator<std::string>(buffer)),
    std::istream_iterator<std::string>());

The "bizarro" part is that without the extra parentheses around the first argument, this would invoke the "most vexing parse", so it would declare a function instead of defining a vector. :-)

Edit2: As far as the edit to the question goes, it seems nearly impossible to answer directly -- it depends on too many types (e.g., CGXStyle, CLVDateTime) that are neither standard nor explained. I, for one, can't follow it in any detail at all. Offhand, this looks like a fairly poor design, letting the user enter things that are more or less ambiguous, and then trying to sort out the mess. Better to use a control that only allows unambiguous input to start with, and you can just read some fields that contain a date and time directly.

Edit3: code to do the splitting that also treats commas as separators could be done like this:

#include <iostream>
#include <locale>
#include <algorithm>
#include <vector>
#include <sstream>

class my_ctype : public std::ctype<char> {
public:
    mask const *get_table() { 
        // this copies the "classic" table used by <ctype.h>:
        static std::vector<std::ctype<char>::mask> 
            table(classic_table(), classic_table()+table_size);

        // Anything we want to separate tokens, we mark its spot in the table as 'space'.
        table[','] = (mask)space;

        // and return a pointer to the table:
        return &table[0];
    }
    my_ctype(size_t refs=0) : std::ctype<char>(get_table(), false, refs) { }
};

int main() { 
    // put our data in a strea:
    std::istringstream buffer("first kas\nhjkfh kjsdjk,asf\tlast");

    // Create a ctype object and tell the stream to use it for parsing tokens:
    my_ctype parser;
    buffer.imbue(std::locale(std::locale(), &parser));

    // separate the stream into tokens:
    std::vector<std::string> strings(
        (std::istream_iterator<std::string>(buffer)),
        std::istream_iterator<std::string>());

    // copy the tokes to cout so we can see what we got:
    std::copy(strings.begin(), strings.end(), 
        std::ostream_iterator<std::string>(std::cout, "\n"));
    return 0;
}

Cool, we use VS2010 to compile this, so `Boost` is a bit of a stretch, but I am sure that lots of libraries are available. — Hamish Grubijan, Jul 01 '10 at 23:01
Wait, Jerry, where do I specify the list of characters to tokenize on? See answer by Beh Tou Cheh above for example. he has: `strtk::parse(data, ", \r\n", str_list);`. — Hamish Grubijan, Jul 02 '10 at 18:20
They're specified in the `locale` used by the stream. By default it'll only be whitespace, but you can create a locale with a `ctype facet` that uses whatever you want. http://stackoverflow.com/questions/1894886/parsing-a-comma-delimited-stdstring/1895584#1895584 — Jerry Coffin, Jul 02 '10 at 18:38
Boost.Tokenzier is ideal for splitting up strings. The iterator interface is very easy to work with too for those familiar with the STL. — Dr. Watson, Jul 02 '10 at 20:22
@Jerry, I noted the edit, and I appreciate it. I am going to test this next week now. — Hamish Grubijan, Jul 02 '10 at 22:30
Jerry, does this program fully compile for you? I had this problem: `thefile.cpp(1404): error C2228: left of '.imbue' must have class/struct/union` as well as: `thefile.cpp(1412): error C2440: '' : cannot convert from 'std::istringstream (__cdecl *)(CStringA)' to 'std::istream_iterator<_Ty>'`. — Hamish Grubijan, Jul 07 '10 at 21:27
@Hamish: it does compile, though looking at it again, it really *should* also `#include ` (VC++ doesn't mind but, for one example, g++ requires it). For better or worse, the C++ standard allows one header to include another, allowing problems like this to slip through... — Jerry Coffin, Jul 08 '10 at 02:15

score 4 · Answer 3 · answered Jul 01 '10 at 22:24

4

The best way to do it would be using strtok. That link should be self explanatory on how to use it, and you can use multiple delimiters as well. Very handy C function.

answered Jul 01 '10 at 22:24

adam_0

6,920
6
40
52

1

+1, but I'm sure someone will have a crazy C++ solution including bizarro syntax to take away your upvotes. – Carl Norum Jul 01 '10 at 22:26
1

This is a good solution for C. If you want to use C++, there's nothing "crazy" about coding to the language's strengths. – Cogwheel Jul 01 '10 at 22:29
11

If `strtok` is the right answer, you've usually asked the wrong question. – Jerry Coffin Jul 01 '10 at 22:34
@Jerry, what do you mean? Should it be strntok instead? – Hamish Grubijan Jul 01 '10 at 22:48
@Hamish: Usually it should be (among other things) something that doesn't modify its input, and doesn't have quite such an error-prone interface. – Jerry Coffin Jul 01 '10 at 22:53
@Jerry, sorry, confused again ... so, `strtok` can modify its input and it has an error-prone interface? How so? Where can I read more on this? – Hamish Grubijan Jul 01 '10 at 23:04
@Hamish: notice how `strtok` does not say `const char *str`, this is because it modifies the input string. – SiegeX Jul 01 '10 at 23:16
@Hamish: More accurately, `strtok` *will* (at least normally) modify its input -- that's how it's defined to work, and part of what makes it error-prone (e.g., passing it a string literal leads to undefined behavior). The sequence to use it is also clumsy -- call once passing the input string, then repeatedly passing NULL, stopping only when it returns NULL. – Jerry Coffin Jul 01 '10 at 23:17
3

One small detail worth noting about strtok: it's not thread safe. I don't know if this applies to OP's question really, but still notable. – Tom Jul 01 '10 at 23:45
@Tom: that's really an implementation detail. The version on most POSIX systems isn't. OTOH, the one in the MS multi-threaded library is thread safe (though non-trivially by allocating thread local storage). – Jerry Coffin Jul 02 '10 at 01:03
1

There's usually nothing *wrong* with `strtok()` modifying its input - in most of the cases where you want to parse with `strtok()` you don't care about the unparsed string any more. The string literal complaint is also bogus; runtime parsing of a compile-time constant string is most definitely a corner case. – caf Jul 02 '10 at 01:44
I am more worried about thread-safety than mutation. I could always make a copy of a string to be mutated, but I do want the result to be correct. – Hamish Grubijan Jul 02 '10 at 02:49
@Hamish Grubijan: Most systems provide a `strtok()` alternative that can be used in thread-safe way - eg the re-entrant `strtok_r()`. – caf Jul 03 '10 at 10:45
Ah, nice ... so, can I do this in VS2010 without installing extra stuff? Do you mind posting a separate answer with an example of working code? – Hamish Grubijan Jul 03 '10 at 17:26
@Hamish Grubijan: `strtok()` is a standard function. I can provide example code of `strtok()` in a few hours when the code will be available to me. Is this what you want, or are you looking for `strtok_r()`? I've never done work with the latter, but I could check it out if that's what you need. Let me know if this would help and if you're still looking at the problem. – adam_0 Jul 19 '10 at 21:47
@adam_O If you do have a solution, then I could use it. I have not fixed the bug yet; other things had more priority. – Hamish Grubijan Jul 20 '10 at 01:13
I'm sorry I've been very busy but here's an example on the web. Hope this helps. http://www.elook.org/programming/c/strtok.html – adam_0 Jul 23 '10 at 03:53

score 1 · Answer 4 · answered Jul 01 '10 at 23:16

Quite an over the top way of sorting this problem is to use the Qt libraries. If you're using KDE then they're installed already. The QString class has a member function split which works like the python version. For example

QString("This is a string").split(" ", QString::SkipEmptyParts)

returns a QStringList of QStrings:

["This", "is", "a", "string"]

(in pythonic syntax). Note the second argument is required or else should the words be split by multiple spaces, each individual one would be returned.

In general I find with the help of the Qt libraries, most of the simplicity of python, eg. simple string parsing and list iteration, can be handled with ease and with the power of C++.

We are a MSFT shop, so I can use whatever standard libs come with VS2010. I like Linux, open source, etc, but I cannot just install arbitrary libs. I could steal a couple of .h and .cpp files though if a license allows it. — Hamish Grubijan, Jul 02 '10 at 02:51

score 0 · Answer 5 · answered Jul 01 '10 at 22:38

0

Parsing strings in C/C++ rarely turns out to be a simple matter. The method you posted looks like it has quite a bit of "history" involved in it. For example, you state that you want to split the string on white space. But the method itself appears to be using a member variable m_strDelim as part of the splitting decision. Simply replacing the method could lead to other unexpected issues.

Using an existing tokenizing class such as this Boost library could simplify things quite a bit.

answered Jul 01 '10 at 22:38

Mark Wilkins

40,729
5
57
110

4

"Parsing strings in C/C++ rarely turns out to be a simple matter." It is a simple matter, but solutions like parsing tend to involve processing strings character-by-character and loops. I.e. I can't think of something (standard) like python's split() function in C++.... – SigTerm Jul 01 '10 at 23:43

score 0 · Answer 6 · answered Jul 02 '10 at 02:31

0

You can use boost::algorithm::split. I.e.:

std::string myString;
std::vector<std::string> splitStrings;
boost::algorithm::split(splitStrings, myString, boost::is_any_of(" \r\n"));

answered Jul 02 '10 at 02:31

Billy ONeal

104,103
58
317
552

+1 The Boost String Algorithms library is essential for string operations! – Matthieu M. Jul 02 '10 at 06:47
I really wish I knew why people were downvoting this. – Billy ONeal Jul 02 '10 at 20:54
I did not downvote. I suppose it is because of my comment (not original post) that I prefer to use standard libs only ... – Hamish Grubijan Jul 02 '10 at 22:32
@Hamish Grubijan: Yes -- agreed -- I was going to delete my answer except I saw that the most upvoted answer to this question is asking to use another library.... – Billy ONeal Jul 03 '10 at 01:49
6

@Billy: I've +1 in the hopes you don't break out into tears :D – Jul 03 '10 at 23:03
@Beh: Thank you :) I don't have a major issue with the downvotes -- just if someone is going to downvote the answer they should at least leave a comment as to why there's a problem with the answer. – Billy ONeal Jul 04 '10 at 02:16

score 0 · Answer 7 · answered Jul 27 '10 at 17:07

A better method than my other answer: TR1's regex feature. Here's a small tutorial to get you started. This answer is C++, uses regular expressions (which is perhaps the best / easiest way to split a string), and I used it myself recently, so I know it's a nice tool.

A better way to split a string into an array of strings in C/C++ using whitespace as a delimiter

7 Answers7

Linked