How to buffer a line in a file by using System Calls in C?

Question

Here is my approach:

int linesize=1
int ReadStatus;
char buff[200];
ReadStatus=read(file,buff,linesize)
while(buff[linesize-1]!='\n' && ReadStatus!=0)
{
   linesize++;
   ReadStatus=read(file,buf,linesize)
}

Is this idea right?

I think my code is a bit inefficient because the run time is O(FileWidth); however I think it can be O(log(FileWidth)) if we exponentially increase linesize to find the linefeed character.

What do you think?

....

I just saw a new problem. How do we read the second line?. Is there anyway to delimit the bytes?

1) `read` doesn't read lines, it just reads bytes. There's no reason `buff` should end with `\n`. 2) For 200 bytes, I would argue that asymptotic complexity hardly matters. The time it takes a `read` syscall to complete isn't constant, but depends on the arguments - therefore you may have `O(log(FileSize))` syscalls, but still `O(FileSize)` performance - you can't read a file without actually reading it in its entirety. But again, 200 bytes are nothing. Disks usually work on a 512-byte basis and file system caches and even CPU/memory caches are a lot larger than that. Maybe of interest: `mmap` — Siguza, Apr 08 '17 at 01:28
Every character written in the file is 1 byte. Is that correct? — daniel, Apr 08 '17 at 01:30
I am trying to track the linefeed character to immediately buffer the line. — daniel, Apr 08 '17 at 01:31
I understand that it reads bytes but all the data is copied into the buffer as characters and as I understand every character is one byte. — daniel, Apr 08 '17 at 01:38
@daniel Depends on multiple factors. First, what humans consider a character may or may not have the size of a C character (see: multibyte characters and `wchar_t`). And secondly, a `char` in C is not guaranteed to have 8 bits - it usually has, but the only thing guaranteed by C in that regard is that `sizeof(char) == 1` (see also `CHAR_BIT`). If your operating system is at all sensible, a comparison to a char literal should be successful though. — Siguza, Apr 08 '17 at 01:43
"And secondly, a char in C is not guaranteed to have 8 bits - it usually has, but the only thing guaranteed by C in that regard is that sizeof(char) == 1", so in conclusion the size of a character is 1 byte and you are telling me I have to be careful with that fact. — daniel, Apr 08 '17 at 01:45
@daniel No. It **usually** is 8 bits (like, on most implementations), but it is **in no way guaranteed** to be. For example, I once did some C programming for a TI-89 graphic calculator where a `char` (and `short` and `int` for that matter) were all 16 bits wide if I remember correctly. — Siguza, Apr 08 '17 at 01:48
@Siguza am doing this for a cat program I am currently coding. Do you think it will be a problem? — daniel, Apr 08 '17 at 01:52
Probably not, as long as you don't explicitly want to deal with multibyte characters. — Siguza, Apr 08 '17 at 01:56
@Siguza Maybe an easier way to do this would be to store the whole file in a buffer and then handle the buffer separately to handle every line. Is that right? — daniel, Apr 08 '17 at 02:13
@daniel Yes. Again, `mmap()` may also be of interest to you. ;) — Siguza, Apr 08 '17 at 11:24
@Siguza In the context of C, a *byte* is always `CHAR_BIT` bits in width, and a *character* (not to be confused with a *multibyte character* or a *wide character*; these have different definitions) is *always* a single byte. See [C11/3.7.1p1](http://port70.net/~nsz/c/c11/n1570.html#3.7.1p1). — autistic, Sep 11 '17 at 22:59

autistic · Answer 1 · 2017-09-11T23:37:58.927

Is this idea right?

No. At the heart of a comment written by Siguza, lies the summary of an issue:

1) read doesn't read lines, it just reads bytes. There's no reason buff should end with \n.

Additionally, there's no reason buff shouldn't contain multiple newline characters, and as there's no [posix] tag here there's no reason to suggest what read does, let alone whether it's a syscall. Assuming you're referring to the POSIX function, there's no error handling. Where's your logic to handle the return value/s reserved for errors?

I think my code is a bit inefficient because the run time is O(FileWidth); however I think it can be O(log(FileWidth)) if we exponentially increase linesize to find the linefeed character.

Providing you fix the issues mentioned above (more on that later), if you were to test this theory, you'd likely find, also at the heart of the comment by Siguza,

Disks usually work on a 512-byte basis and file system caches and even CPU/memory caches are a lot larger than that.

To an extent, you can expect your idea to approach O(log n), but your bottleneck will be one of those cache lines (likely the one closest to your keyboard/the filesystem/whatever is feeding the stream with information). At that point, you should stop guzzling memory which other programs might need because your optimisation becomes less and less effective.

What do you think?

I think you should just STOP! You're guessing!

Once you've written your program, decide whether or not it's too slow. If it's not too slow, it doesn't need optimisation, and you probably won't shave enough nanoseconds to make optimisation worthwhile.

If it is to slow, then you should:

Use a profiler to determine what the most significant bottleneck is,
apply optimisations based on what your profiler tells you, then
use your profiler again, with the same inputs as before, to measure the effect your optimisation had.

If you don't use a profiler, your guess-work could result in slower code, or you might miss opportunities for more significant optimisations...

How do we read the second line?

Naturally, it makes sense to read character by character, rather than two hundred characters at a time, because there's no other way to stop reading the moment you reach a line terminating character.

Is there anyway to delimit the bytes?

Yes. The most sensible tools to use are provided by the C standard, and syscalls are managed automatically to be most efficient based on configurations decided by the standard library devs (who are much likely better at this than you are). Those tools are:

fgets to attempt to read a line (by reading one character at a time), up to a threshold (the size of your buffer). You get to decide how large a line should be, because it's more often the case that you won't expect a user/program to input huge lines.
strchr or strcspn to detect newlines from within your buffer, in order to determine whether you read a complete line.
scanf("%*[^\n]"); to discard the remainder of an incomplete line, when you detect those.
realloc to reallocate your buffer, if you decide you want to resize it and call fgets a second time to retrieve more data rather than discarding the remainder. Note: this will have an effect on the runtime complexity of your code, not that I think you should care about that...

Other options are available for the first three. You could use fgetc (or even read one character at a time) like I did at the end of this answer, for example...

In fact, that answer is highly relevant to your question, as it does make an attempt to exponentially increase the size. I wrote another example of this here.

It should be pointed out that the reason to address these problems is not so much optimisation, but the need to read a large, yet variadic in size chunk of memory. Remember, if you haven't yet written the code, it's likely you won't know whether it's worthwhile optimising it!

Suffice to say, it isn't the read function you should try to reduce your dependence upon, but the malloc/realloc/calloc function... That's the real kicker! If you don't absolutely need to store the entire line, then don't!

How to buffer a line in a file by using System Calls in C?

1 Answers1