How can tell the end of a line with c

Question

I don't know whether the line is ended by '\n' or '\r' or '\r\n' and don't what the text is encoded by , besides if the encode is utf-8, it can be no bom. Is there a function or a lib can do this ,or just tell me the termination of a line.

See that question : http://stackoverflow.com/questions/1279779/what-is-the-difference-between-r-and-n — SolarBear, Mar 28 '13 at 13:06
If you don't know the encoding then it can't be done with certainty. Consider the sequence of bytes `30 0A`. Unless you know the encoding, there is no way to tell whether that is the ASCII representation of the numeral "0" followed by a linebreak, or the UTF16-BE representation of the character "《". So, first you need a library to guess character encoding, then you can think about linebreaks. — Steve Jessop, Mar 28 '13 at 13:23
Do you mean any text encoding, or is it always ASCII/UTF-8 but with differing line terminators? — teppic, Mar 28 '13 at 14:04
@SteveJessop If you assume that if it is the GBK encoding then terminator is '\r\n',if utf8, encoding terminator is '\n',then is there better way to do? — choury, Mar 28 '13 at 14:50

score 1 · Answer 1 · answered Mar 28 '13 at 13:07

1

Use wcslen to get the size in byte of an utf8 string.

http://linux.die.net/man/3/wcslen

answered Mar 28 '13 at 13:07

Gull_Code

115
1
5

1

What does this have to do with line termination? – autistic Mar 28 '13 at 13:10
Except if he use some kind of memory mapping in his source he'll likely have the line inside a char array. He also said it can be utf8. Having the size in bytes of the utf8 string also gives you the real size of the string, start + size = the end of line. – Gull_Code Mar 28 '13 at 13:15
But the source is just plain text(not just English). Because it is created at *nix or windows (not by me),so the format is not specific. – choury Mar 28 '13 at 13:23
Then maybe have a look at Enca "Enca is an Extremely Naive Charset Analyser. It detects character set and encoding of text files and can also convert them to other encodings using either a built-in converter or external libraries and tools like libiconv, librecode, or cstocs." http://cihar.com/software/enca/ And then adapt the way your program work along with the encoding – Gull_Code Mar 28 '13 at 13:30
OK,I'll go to have a look at its source files to see if I can find something – choury Mar 28 '13 at 13:33

autistic · Accepted Answer · 2017-09-29T06:13:54.487

1

Are you by chance using fgets, fread, fputs, fwrite, etc, on a file that is open for reading text? If so, the implementation will automatically transform OS-specific line terminators (eg. "\r\n") into '\n' when reading, and transform '\n' into OS-specific line terminators when writing.

There are two other scenarios, one of which it turns out was OP:

OP was struggling with "\r\n" being carried over from other OS software, and so opening files for reading in his (presumably Unix-like) OS would no longer convert that. My suggestion is to use dos2unix for these one-off conversions, rather than bloating your code with something which will likely never run again.
You're not using one of those functions. This could be because you're using a stream such as a socket, and perhaps the protocol requires "\r\n". In this case, you should use strstr to find the exact sequence "\r\n".

UTF-8 was designed with a degree of compatibility to ASCII in mind, hence you can assume that any system that uses UTF-8 will also use ASCII or some similar character set. Any characters that use sequences larger than one byte will only use values 0x80 or greater to represent. Since '\n' lies within the 0x00-0x7F range, you're guaranteed that it'll be a single byte and it won't exist as part of a multi-byte character.

edited Sep 29 '17 at 06:13

answered Mar 28 '13 at 13:22

autistic

1
3
35
80

Unfortunately reading in text mode it won't turn non-OS-specific `\r\n` into `\n` when reading. So if you need to cope with the possibility of someone copying a text file from Windows to Linux, you need another solution. – Steve Jessop Mar 28 '13 at 13:26
@SteveJessop While I agree that it's annoying when someone mixes up encodings, there are utilities in existence that perform this transformation for you. Why reinvent the wheel? If you spend fifteen minutes accounting for each OS-specific line ending, then you'll end up with a very complex solution to a simple problem. – autistic Mar 28 '13 at 13:31
@choury Is it common for your program to deal with different OSes, or is this a problem you can rarely see your program dealing with (eg. does your program mostly deal with text files produced on the same OS)? Why introduce bloat when you could use other programs to perform the conversion for you (eg. `dos2unix`, `unix2dos`), outside of your program? – autistic Mar 28 '13 at 13:34
@modifiablelvalue Yes,most of the file is from the same OS,but the rest is had to be dealt with too. I don't know which OS a file comes from. – choury Mar 28 '13 at 13:45
@choury Ahh, so you're planning on interpreting LF as a line terminator for Unix, CR+LF as a line terminator for MS-DOS/Windows (pre-Windows 7), LF+CR as a line terminator for RISC OS, CR as a line terminator for MacOS (pre-MacOSX), RS for QNX, NL for z/OS and many more as newlines? That's a lot of bloat... Are you going to interpret `'N'` as a newline, too? – autistic Mar 28 '13 at 14:09
@modifiablelvalue Not that many OSes,just Unix,MS-Dos/Windows and Mac should be considered. – choury Mar 28 '13 at 14:19
@choury No. What *should be considered* is the feasibility of writing this code when it's already been written. No real-world employer would ask you to do this. The cost of the time that it takes to define the problem specifically, produce the code and debug the code won't be justified for another ten years, considering you can just convert these files using the conversion app... If you want to solve this problem, you need a specific description of the problem from the perspective of a system analyst familiar with the internals of fgets who can answer any questions we're likely to raise. – autistic Mar 28 '13 at 14:43
@choury So far, you have ignored this question: Why introduce bloat when you could use other programs to perform the conversion for you (eg. `dos2unix`, `unix2dos`), outside of your program? – autistic Mar 28 '13 at 14:44
@modifiablelvalue Thanks, I thought it can just convert utf8 – choury Mar 28 '13 at 15:11
@modifiablelvalue As you can see in my profile,This is my first question in stackoverflow .It is a question related my homework of "Compilation Principle". And English is not my mother tongue ,so I have little way to express my meaning accurately. – choury Mar 28 '13 at 15:23

How can tell the end of a line with c

2 Answers2