33

To read lines from a file there are the getline() and fgets() POSIX functions (ignoring the dreaded gets()). It is common sense that getline() is preferred over fgets() because it allocates the line buffer as needed.

My question is: Isn’t that dangerous? What if by accident or malicious intent someone creates a 100GB file with no '\n' byte in it – won’t that make my getline() call allocate an insane amount of memory?

edavid
  • 371
  • 1
  • 3
  • 5
  • Doesn't that depend on implementation? – MartinVeronneau May 03 '19 at 13:03
  • 1
    Sure. My fear is that my system might then be swapping to the point of stalling. With `fgets()` this wouldn’t be possible. I wonder if I’m misunderstanding something or if this risk of `getline()` is known. – edavid May 03 '19 at 13:06
  • @xing only error code I see mentioned in the man page is `EINVAL`. – Christian Gibbons May 03 '19 at 13:07
  • 4
    `an insane amount of memory` - is 100GB an insane amount of memory? is 100KB? is 1PB? Tell that someone 20 years from now.... What amount of memory is insane and what isn't? `getline()` is a function from the 1990s. It exists to "make life easier", not to "handle all the crazy cases an user wants". Writing a `getline()` implementation with max limit is not that hard. – KamilCuk May 03 '19 at 14:41
  • 1
    @KamilCuk Yes, no, and yes, respectively. In 20 years, the answers will be no, no, and yes. The exact threshold for what constitutes an insane amount of memory will certainly increase in later years, but the underlying problem (that there exist lines that will cause an out of memory condition or thrashing) will exist on any finite memory system. – Ray May 03 '19 at 14:50
  • 3
    @KamilCuk: The notion of a file containing a 100GB record without a length prefix seems a bit insane, no matter how much memory the target machine has. – supercat May 03 '19 at 15:42
  • @supercat: I wouldn't say it's insane, actually. What seems insane is not using fread if you're expecting that kind of file, or not checking getline's error return if you're not using a previously-allocated buffer. – jamesqf May 03 '19 at 17:45
  • @xing Interesting. My man page is from the 3.53 linux man page release. Is yours newer? – Christian Gibbons May 03 '19 at 17:55
  • `an insane amount of memory` – an order of magnitude more than the size of the RAM, on any machine from a Zuse II to HAL 9000. – edavid May 06 '19 at 08:47
  • @jamesqf, even if you pass a previously allocated memory buffer to `getline()`, it will expand it with no limit using `realloc()`. – edavid May 06 '19 at 08:49

6 Answers6

28

My question is: Isn’t that dangerous? What if by accident or malicious intent someone creates a 100GB file with no '\n' byte in it – won’t that make my getline() call allocate an insane amount of memory?

Yes, what you describe is a plausible risk. However,

  • if the program requires loading an entire line into memory at once, then allowing getline() to attempt to do that is not inherently more risky than writing your own code to do it with fgets(); and
  • if you have a program that has such a vulnerability, then you can mitigate the risk by using setrlimit() to limit the total amount of (virtual) memory it can reserve. This can be used to cause it to fail instead of successfully allocating enough memory to interfere with the rest of the system.

Best overall, I'd argue, is to write code that does not require input in units of full lines (all at once) in the first place, but such an approach has its own complexities.

John Bollinger
  • 160,171
  • 8
  • 81
  • 157
  • setrlimit might work, but there are pitfalls; it can't limit resident memory (RSS) on modern kernels, and virtual memory (VSZ) can and often does safely exceed physical RAM size well before it will start thrashing, so limiting that may result in allocations failing when you're rather they not do so. (https://stackoverflow.com/questions/39755928/how-do-i-use-setrlimit-to-limit-memory-usage-rlimit-as-kills-too-soon-rlimit), (https://superuser.com/questions/618687/why-do-programs-on-linux-kernel-use-so-much-more-vmem-than-resident-memory) – Ray May 03 '19 at 15:06
  • @Ray, as for `setrlimit()`, I specifically suggested limiting the amount of virtual memory allowed to the process, not its RSS. The question of what limit to choose is tangential and not addressed. – John Bollinger May 03 '19 at 15:20
  • Fair point on fgets. I've removed that objection. But the setrlimit issue is still there even when limiting virtual memory; RSS is what we actually *want* to limit to avoid thrashing, and VSZ is a *loose* upper bound of that. For example, I've got a copy of gvim running that is using 5 MB of actual memory, but 550 MB of VSZ. Using setrlimit to limit virtual memory to the actual RAM size will result in allocations failing well before they need to. I agree that setrlimit may be useful here, but that approach has issues that the OP needs to be aware of before it'll be useful. – Ray May 03 '19 at 15:31
  • Fair enough, @Ray, but I contend that if it makes sense at all for the program to require entire lines to be loaded into memory at once then it is reasonable for the program to set a VSZ limit *much* smaller than needed to avoid that operation thrashing. The OP does postulate that the input he describes is malformed, after all. If the program needs to process huge lines, then requiring them to be held in memory in their entirety is the issue. – John Bollinger May 03 '19 at 16:01
14

It can be dangerous, yes. Don’t know how this would work on other computers, but running the code below froze my computer to the point of needing a hard reset:

/* DANGEROUS CODE */

#include <stdio.h>

int main(void)
{
    FILE *f;
    char *s;
    size_t n = 0;

    f = fopen("/dev/zero", "r");
    getline(&s, &n, f);

    return 0;
}
Tom Zych
  • 13,329
  • 9
  • 36
  • 53
  • Actually after a while the output is `$ Killed`. No OS freeze. Thats not fun :(. I guess Ubuntu has some defence againts processes consuming too much memory now :( – milanHrabos Nov 17 '22 at 21:46
3

Some coding guidelines (like MISRA C) may prevent you to use dynamic memory allocations (like getline()). There are reasons for that, for example memory leak avoiding.

If you know maximum size of all acceptable lines, then you could avoid memory allocations by using fgets() instead of getline(), and so remove one potential memory leak point.

SKi
  • 8,007
  • 2
  • 26
  • 57
2

The getline function uses malloc and realloc internally and returns -1 if they fail, so the result is no different than if you attempted to call malloc(100000000000). Namely, errno gets set to ENOMEM and getline returns -1.

So you would have the same problem whether you used getline or tried to do the same thing with fgets and manual memory allocation to ensure you read a full line.

dbush
  • 205,898
  • 23
  • 218
  • 273
  • 7
    Except with `fgets()` and manual allocation you can choose to bail-out long before your process's whole memory is exhausted. – TripeHound May 03 '19 at 13:47
  • 6
    Actually the behavior is very different from just calling `malloc(100000000000)` because `getline()` will first allocate a small size and reallocate progressively all the way to the point where it finds the newline, the end of file or the memory limit. This process will cause tremendous stress on the system, potentially to the point where it becomes unusable, whereas `malloc(100000000000)` might just return `NULL` immediately with no consequences. – chqrlie May 05 '19 at 07:07
1

Really it depends how you want to handle lines that are too long.

fgets with a decent sized buffer will work generally, and you can detect that it has "failed" - the buffer end has no newline char. It is possible to avoid always doing a strlen() to confirm if the buffer is overflowed, but that is a different question.

Perhaps your strategy is to simply skip lines that can't be processed, or perhaps the rest of the line is just a comment you would ignore anyway, in which case, it is easy to then put fgets in a loop to discard the rest of the line with no allocation penalty.

If you do want to read the whole line regardless then getline may be the better strategy for you. The malicious user would need a lot of disk space to cause the bad behaviour you describe, or perhaps pass /dev/random or similar as the input filename.

Again, if getline can't realloc it will fail in a way that you can recover from, though if you are reusing the buffer for multiple line reads, you might want to free the buffer that it does have after an error before trying to read more, as it is still allocated and may have grown as large as it could before failing.

Gem Taylor
  • 5,381
  • 1
  • 9
  • 27
  • I now decided to set a line length limit of 10,000 and use `fgets()`. Limiting the line length is perfectly fine in my use case. I asked the original question because in the [glibc documentation for `fgets()`](https://www.gnu.org/software/libc/manual/html_node/Line-Input.html#index-fgets) there is a warning about a flaw of `fgets()`, the caller can’t tell if the input data contain NUL bytes, recommending `getline()` instead. However, for `getline()`, which could bring the system down by uncontrolled memory allocation, there is no warning. – edavid May 06 '19 at 09:17
  • Hmm... a lot of the c string API is defeated if your string contains null characters, but if it does, then it is not a text file, and so you shouldn't be using line-based processing at all. – Gem Taylor May 07 '19 at 09:51
  • In terms of the malicious potential of injecting null bytes, your code will see a shortened line and perhaps (depending how you detect it) then treat the next line as a line continuation. I can't immediately see how that would be worse than any other slightly dodgy text file input. – Gem Taylor May 07 '19 at 09:56
  • 1
    Exactly. However, the glibc documentation says about fgets: “Don’t use it to read files edited by the user because, if the user inserts a null character, you should either handle it properly or print a clear error message. We recommend using getline instead of fgets.” – I am surprised they care about this nuisance a lot but apparently not about the hazard of unlimited memory allocation of getline. – edavid May 08 '19 at 15:58
  • Hmm... do they suggest a way to handle it properly, perchance? I'm down to `strlen` vs `ftell` delta mismatch? `ftell` is an ok speed call, but it does mean you had to call `strlen` Actually using `ftell` is not a bad strategy in general, but you do need to deal with windows/unix text files. – Gem Taylor May 08 '19 at 16:25
  • They suggest using `getline` so that one can scan the line for NUL bytes. With `fgets` one needs to call `strlen` anyway, it inserts a NUL byte at the end of the line – which is why, encountering NUL, one can’t tell whether it was in the file or has been inserted by `fgets`. The `ftell` method would work with seekable files but not pipe-like streams such as `stdin`. Alternatively, one could fill the line buffer with known bytes before passing it to `fgets`. If the bytes following a NUL are different from the known byte sequence, the NUL is most likely from the file. – edavid May 09 '19 at 10:05
  • In the end, if you are going to be this paranoid, it isn't very difficult to set up your own fixed buffer, fread it full, scan for newlines with memchr, memcpy down the unused portion and fread again! Perhaps I'll write that up sometime. – Gem Taylor May 09 '19 at 10:43
0

getline() reallocate the buffer for you to alleviate a little bit the memory management in your program.

But indeed, that could lead to a large chunk of memory being allocated. If that is a concern, then you should take extra steps to use functions that does not allocate memory implicitly.

MartinVeronneau
  • 1,296
  • 7
  • 24