55

I want to safely read a line from an std::istream. The stream could be anything, e.g., a connection on a Web server or something processing files submitted by unknown sources. There are many answers starting to do the moral equivalent of this code:

void read(std::istream& in) {
    std::string line;
    if (std::getline(in, line)) {
        // process the line
    }
}

Given the possibly dubious source of in, using the above code would lead to a vulnerability: a malicious agent may mount a denial of service attack against this code using a huge line. Thus, I would like to limit the line length to some rather high value, say 4 millions chars. While a few large lines may be encountered, it isn't viable to allocate a buffer for each file and use std::istream::getline().

How can the maximum size of the line be limited, ideally without distorting the code too badly and without allocating large chunks of memory up front?

Dietmar Kühl
  • 150,225
  • 13
  • 225
  • 380
  • 8
    What about a custom allocator that will throw if asked to allocate beyond your threshold? Construct a `basic_string` object using that allocator and read into it. – Praetorian Dec 17 '13 at 22:00
  • Maybe subclassing std::string and providing a `max_size()` function that spits out something small? – Collin Dec 17 '13 at 22:07
  • @Praetorian: I guess, using an allocator would be an option. Sadly, it changes the type of `std::string`. – Dietmar Kühl Dec 17 '13 at 22:09
  • 2
    You could replace the streambuf of `in` with your own implementation that wraps the original streambuf and sends a `'\n'` when certain number of characters are read. – jrok Dec 17 '13 at 22:39
  • 1
    @DietmarKühl: Maybe you can try to simply check number of chars in buffer before extracting: `if (in.rdbuf()->in_avail() > max_size) { /* end */ }`... – mb84 Dec 19 '13 at 18:40
  • @mb84: `in_avail()` shows how many characters are known to be available, not how many there will be available in total. Basically, it tells how many characters are currently in the stream buffer's buffer. Once these are read, the stream buffer will provide another buffer of characters, possibly blocking until more characters arrive or it becomes apparent that the source has gone stale. – Dietmar Kühl Dec 19 '13 at 18:44
  • I wonder whether you should apply security measures at the level of `getline`, or higher up in your application. E.g. if you are reading from a socket (e.g. through `boost::asio::ip::tcp::iostream`), limiting the buffer size may guard against a single huge line. But a malicious agent may also mount a DoS attack by sending a huge numnber of small lines. Does your code also aim to guard against that threat? Or should it somehow be guarded against in your firewall, outside your application? – TemplateRex Dec 26 '13 at 09:57
  • @TemplateRex: although clearly the total amount of data needs to be controlled, too, there is a way to break out of a loop reading lines after a suitable number of lines. The issue with functions like `std::getline()` is that there is no obvious way to control when they should stop accepting input. That said, from the approaches suggested the approach limiting the amount of data forwarded by a stream buffer could be used easily to limit the total amount of data (in fact, it is easier and more efficient to limit the total amount of data). – Dietmar Kühl Dec 26 '13 at 15:55
  • `a malicious agent may **mount** a denial of service attack against this code using a huge line` -> That sounded soooo awesome but since I'm a C++ noobie, can you please tell how could one do that? :) – Angelus Mortis Feb 28 '16 at 17:02

4 Answers4

37

You could write your own version of std::getline with a maximum number of characters read parameter, something called getline_n or something.

#include <string>
#include <iostream>

template<typename CharT, typename Traits, typename Alloc>
auto getline_n(std::basic_istream<CharT, Traits>& in, std::basic_string<CharT, Traits, Alloc>& str, std::streamsize n) -> decltype(in) {
    std::ios_base::iostate state = std::ios_base::goodbit;
    bool extracted = false;
    const typename std::basic_istream<CharT, Traits>::sentry s(in, true);
    if(s) {
        try {
            str.erase();
            typename Traits::int_type ch = in.rdbuf()->sgetc();
            for(; ; ch = in.rdbuf()->snextc()) {
                if(Traits::eq_int_type(ch, Traits::eof())) {
                    // eof spotted, quit
                    state |= std::ios_base::eofbit;
                    break;
                }
                else if(str.size() == n) {
                    // maximum number of characters met, quit
                    extracted = true;
                    in.rdbuf()->sbumpc();
                    break;
                }
                else if(str.max_size() <= str.size()) {
                    // string too big
                    state |= std::ios_base::failbit;
                    break;
                }
                else {
                    // character valid
                    str += Traits::to_char_type(ch);
                    extracted = true;
                }
            }
        }
        catch(...) {
            in.setstate(std::ios_base::badbit);
        }
    }

    if(!extracted) {
        state |= std::ios_base::failbit;
    }

    in.setstate(state);
    return in;
}

int main() {
    std::string s;
    getline_n(std::cin, s, 10); // maximum of 10 characters
    std::cout << s << '\n';
}

Might be overkill though.

Rapptz
  • 20,807
  • 5
  • 72
  • 86
  • 3
    Writing a version of `getline()` could be an option (especially as I have actually done so in the past having implemented all of the IOStreams library). I don't know why it didn't occure to me: maybe I was too focussed on two other solutions (only one of which mentioned so far). – Dietmar Kühl Dec 17 '13 at 22:21
  • +1. The only thing I question is the call to `reserve`, since the OP is talking on the order of 4Mbytes as a guard, but might only be dealing with much smaller sized strings. It might be better to have the user perform the reserve on their own. – Dave S Dec 17 '13 at 22:25
  • @DaveS Funny enough, the original version that I wrote this didn't have the `reserve` call but I added it in for good measure. If it was up to me I wouldn't have done it either. I figure I'll just remove it. – Rapptz Dec 17 '13 at 22:26
  • I'm confused - where does this code actually check for the new line character? – Siler Sep 18 '18 at 17:41
18

There is already such a getline function as a member function of istream, you just need to wrap it for buffer management.

#include <assert.h>
#include <istream>
#include <stddef.h>         // ptrdiff_t
#include <string>           // std::string, std::char_traits

typedef ptrdiff_t Size;

namespace my {
    using std::istream;
    using std::string;
    using std::char_traits;

    istream& getline(
        istream& stream, string& s, Size const buf_size, char const delimiter = '\n'
        )
    {
        s.resize( buf_size );  assert( s.size() > 1 );
        stream.getline( &s[0], buf_size, delimiter );
        if( !stream.fail() )
        {
            Size const n = char_traits<char>::length( &s[0] );
            s.resize( n );      // Downsizing.
        }
        return stream;
    }
}  // namespace my
Cheers and hth. - Alf
  • 142,714
  • 15
  • 209
  • 331
  • 1
    `&s[0]` makes me feel uneasy – Inverse Dec 20 '13 at 00:16
  • 4
    @Inverse: There's nothing to feel uneasy about. You might as well feel uneasy about a division, reasoning that it could turn out to be a division by zero. In this code the relevant constraint on the value (that the string length must be >0) is expressed by an `assert`, which is generally good practice and makes the code safer than without it. With the `assert` one must work hard in order to bring forth the nasal daemons of UB, Namely, call the function with invalid argument for `buf_size` *and* do this with `NDEBUG` defined so as to suppress the `assert`. That's why you should use `assert`. – Cheers and hth. - Alf Dec 20 '13 at 06:53
  • Oh I meant that, my understanding is the data in a `std::string` is not guaranteed to be continuous, it's only `.c_str()` that is. So `&v[0]` is fine for a `std::vector` but not strictly for `std::string` – Inverse Jan 17 '14 at 15:54
  • @Inverse: `std::string` buffer has been guaranteed contiguous since the Lillehammer meeting in 2005. Of course it wasn't official until C++11, but no compiler vendor would be able to sell a compiler where they introduced the opposite, given both existing code dependencies and known wording in the future C++11 standard. At the time one way to identify pretend-language-lawyers was discussion about whether this could be relied on. ;-) – Cheers and hth. - Alf Jan 18 '14 at 02:42
  • Is copy-on-write std::string an issue? (There certainly were such implementations around in the early 2000's, although C++11 outlaws them) – M.M Dec 18 '14 at 03:23
  • @MattMcNabb: No, COW is not an issue here. It has to be compatible with any ordinary single-threaded use of strings, and so it has to perform any deferred copying when `operator[]` is used (or any reference to the buffer is obtained). At least one of g++'s standard library implementations used COW strings, and I'd be surprised if that has changed even if C++11 really prohibits that. But I'd need to see the wording to be convinced that they're really invalid now. Without knowing anything about the wording, I would think that at most tread safety for reads is required, i.e. an efficiency issue. – Cheers and hth. - Alf Dec 18 '14 at 03:50
  • The SO thread on C++11 COW is [here](http://stackoverflow.com/questions/12199710/legality-of-cow-stdstring-implementation-in-c11) – M.M Dec 18 '14 at 03:51
  • @MattMcNabb: Thanks! Well the selected answer doesn't scan, really. Faulty logic, because the COW copying is done the first time an external reference is obtained (for the given buffer allocation), so it doesn't (it can't logically) invalidate any existing references. But a faulty argument doesn't make the position argued for, untrue; it may still be that C++ prohibits COW. Possibly. – Cheers and hth. - Alf Dec 18 '14 at 03:55
  • @MattMcNabb: [Luc Danton's answer](http://stackoverflow.com/a/12199893/464581) explains how COW is at least partially prohibited in C++11, by a requirement that the copy constructor makes a buffer copy. – Cheers and hth. - Alf Dec 18 '14 at 04:11
  • @MattMcNabb: Sorry, I'm evidently tired. Luc's answer doesn't show that COW is prohibited in any way. Merely that a call of data() must trigger COW copying if the buffer is currently shared. Hm. – Cheers and hth. - Alf Dec 18 '14 at 04:18
8

Replace std::getline by creating a wrapper around std::istream::getline:

std::istream& my::getline( std::istream& is, std::streamsize n, std::string& str, char delim )
    {
    try
       {
       str.resize(n);
       is.getline(&str[0],n,delim);
       str.resize(is.gcount());
       return is;
       }
    catch(...) { str.resize(0); throw; }
    }

If you want to avoid excessive temporary memory allocations, you could use a loop which grows the allocation as needed (probably doubling in size on each pass). Don't forget that exceptions might or might not be enabled on the istream object.

Here's a version with the more efficient allocation strategy:

std::istream& my::getline( std::istream& is, std::streamsize n, std::string& str, char delim )
    {
    std::streamsize base=0;
    do {
       try
          {
          is.clear();
          std::streamsize chunk=std::min(n-base,std::max(static_cast<std::streamsize>(2),base));
          if ( chunk == 0 ) break;
          str.resize(base+chunk);
          is.getline(&str[base],chunk,delim);
          }
       catch( std::ios_base::failure ) { if ( !is.gcount () ) str.resize(0), throw; }
       base += is.gcount();
       } while ( is.fail() && is.gcount() );
    str.resize(base);
    return is;
    }
Brent Bradburn
  • 51,587
  • 17
  • 154
  • 173
  • 1
    As implemented, both of these will leave a terminating '\0' character in the generated string. This is not normal for a C++ string, so an improvement would be to pop the final character before returning. Note that resizing based on scanning the string for '\0' might be considered buggy since '\0' could be a valid character within the string (this isn't 'C'). Also, I don't know exactly how this interacts with Microsoft 'text' mode where text lines are typically terminated by *two* characters. If I understand the docs, a '\r' would be left in the string because is.getline() is "unformatted". – Brent Bradburn Feb 17 '14 at 04:54
  • ... `if ( !str.empty() ) str.resize(str.size()-1);` – Brent Bradburn Feb 17 '14 at 05:00
  • `try { my::getline( is, 4096, s, '\n' ); } catch ( std::ios::failure const & ) {}` – Brent Bradburn Apr 21 '16 at 17:19
5

Based on the comments and answers, there seem to be three approaches:

  1. Write a custom version of getline() possibly using the std::istream::getline() member internally to get the actual characters.
  2. Use a filtering stream buffer to limit the amount of data potentially received.
  3. Instead of reading a std::string, use a string instantiation with a custom allocator limiting the amount of memory stored in the string.

Not all of the suggestions came with code. This answer provides code for all approaches and a bit of discussion of all three approaches. Before going into implementation details it is first worth pointing out that there are multiple choices of what should happen if an excessively long input is received:

  1. Reading an overlong line could result in a successful read of a partial line, i.e., the resulting string contains the read content and the stream doesn't have any error flags set. Doing so means, however, that it isn't possible to distinguish between a line hitting exactly the limit or being too long. Since the limit is somewhat arbitrary anyway it probably doesn't really matter, though.
  2. Reading an overlong line could be considered a failure (i.e., setting std::ios_base::failbit and/or std::ios_base::bad_bit) and, since reading failed, yield an empty string. Yielding an empty string, obviously, prevents potentially looking at the string read so far to possibly see what's going on.
  3. Reading an overlong line could provide the partial line read and also set error flags on the stream. This seems reasonable behavior both detecting that there is something up and also providing the input for potential inspection.

Although there are multiple code examples implementing a limited version of getline() already, here is another one! I think it is simpler (albeit possibly slower; performance can be dealt with when necessary) which also retains's std::getline()s interface: it use the stream's width() to communicate a limit (maybe taking width() into account is a reasonable extension to std::getline()):

template <typename cT, typename Traits, typename Alloc>
std::basic_istream<cT, Traits>&
safe_getline(std::basic_istream<cT, Traits>& in,
             std::basic_string<cT, Traits, Alloc>& value,
             cT delim)
{
    typedef std::basic_string<cT, Traits, Alloc> string_type;
    typedef typename string_type::size_type size_type;

    typename std::basic_istream<cT, Traits>::sentry cerberos(in);
    if (cerberos) {
        value.clear();
        size_type width(in.width(0));
        if (width == 0) {
            width = std::numeric_limits<size_type>::max();
        }
        std::istreambuf_iterator<char> it(in), end;
        for (; value.size() != width && it != end; ++it) {
            if (!Traits::eq(delim, *it)) {
                value.push_back(*it);
            }
            else {
                ++it;
                break;
            }
        }
        if (value.size() == width) {
            in.setstate(std::ios_base::failbit);
        }
    }
    return in;
}

This version of getline() is used just like std::getline() but when it seems reasonable to limit the amount of data read, the width() is set, e.g.:

std::string line;
if (safe_getline(in >> std::setw(max_characters), line)) {
    // do something with the input
}

Another approach is to just use a filtering stream buffer to limit the amount of input: the filter would just count the number of characters processed and limit the amount to a suitable number of characters. This approach is actually easier applied to an entire stream than an individual line: when processing just one line, the filter can't just obtain buffers full of characters from the underlying stream because there is no reliable way to put the characters back. Implementing an unbuffered version is still simple but probably not particularly efficient:

template <typename cT, typename Traits = std::char_traits<char> >
class basic_limitbuf
    : std::basic_streambuf <cT, Traits> {
public:
    typedef Traits                    traits_type;
    typedef typename Traits::int_type int_type;

private:
    std::streamsize                   size;
    std::streamsize                   max;
    std::basic_istream<cT, Traits>*   stream;
    std::basic_streambuf<cT, Traits>* sbuf;

    int_type underflow() {
        if (this->size < this->max) {
            return this->sbuf->sgetc();
        }
        else {
            this->stream->setstate(std::ios_base::failbit);
            return traits_type::eof();
        }
    }
    int_type uflow()     {
        if (this->size < this->max) {
            ++this->size;
            return this->sbuf->sbumpc();
        }
        else {
            this->stream->setstate(std::ios_base::failbit);
            return traits_type::eof();
        }
    }
public:
    basic_limitbuf(std::streamsize max,
                   std::basic_istream<cT, Traits>& stream)
        : size()
        , max(max)
        , stream(&stream)
        , sbuf(this->stream->rdbuf(this)) {
    }
    ~basic_limitbuf() {
        std::ios_base::iostate state = this->stream->rdstate();
        this->stream->rdbuf(this->sbuf);
        this->stream->setstate(state);
    }
};

This stream buffer is already set up to insert itself upon construction and remove itself upon destruction. That is, it can be used simply like this:

std::string line;
basic_limitbuf<char> sbuf(max_characters, in);
if (std::getline(in, line)) {
    // do something with the input
}

It would be easy to add a manipulator setting up the limit, too. One advantage of this approach is that none of the reading code needs to be touched if the total size of the stream could be limited: the filter could be set up right after creating the stream. When there is no need to back out the filter, the filter could also use a buffer which would greatly improve the performance.

The third approach suggested is to use a std::basic_string with a custom allocator. There are two aspects which are a bit awkward about the allocator approach:

  1. The string being read actually has a type which isn't immediately convertible to std::string (although it also isn't hard to do the conversion).
  2. The maximum array size can be easily limited but the string will have some more or less random size smaller than that: when the stream fails allocating an exception is thrown and there is no attempt to grow the string by a smaller size.

Here is the necessary code for an allocator limiting the allocated size:

template <typename T>
struct limit_alloc
{
private:
    std::size_t max_;
public:
    typedef T value_type;
    limit_alloc(std::size_t max): max_(max) {}
    template <typename S>
    limit_alloc(limit_alloc<S> const& other): max_(other.max()) {}
    std::size_t max() const { return this->max_; }
    T* allocate(std::size_t size) {
        return size <= max_
            ? static_cast<T*>(operator new[](size))
            : throw std::bad_alloc();
    }
    void  deallocate(void* ptr, std::size_t) {
        return operator delete[](ptr);
    }
};

template <typename T0, typename T1>
bool operator== (limit_alloc<T0> const& a0, limit_alloc<T1> const& a1) {
    return a0.max() == a1.max();
}
template <typename T0, typename T1>
bool operator!= (limit_alloc<T0> const& a0, limit_alloc<T1> const& a1) {
    return !(a0 == a1);
}

The allocator would be used something like this (the code compiles OK with a recent version of clang but not with gcc):

std::basic_string<char, std::char_traits<char>, limit_alloc<char> >
    tmp(limit_alloc<char>(max_chars));
if (std::getline(in, tmp)) {
    std::string(tmp.begin(), tmp.end());
    // do something with the input
}

In summary, there are multiple approach each with its own small drawback but each reasonably viable for the stated goal of limiting denial of service attacks based on overlong lines:

  1. Using a custom version of getline() means the reading code needs to be changed.
  2. Using a custom stream buffer is slow unless the entire stream's size can be limited.
  3. Using a custom allocator gives less control and requires some changes to reading code.
Dietmar Kühl
  • 150,225
  • 13
  • 225
  • 380