Replacing multiple new lines in a file with just one

Question

This function is supposed to search through a text file for the new line character. When it finds the newline character, it increments the newLine counter, and when there are more than 2 consecutive blank new lines, its suppose to squeeze all the blank lines into just one blank line.

In my code if there are 2 new lines it's suppose to get rid of them and squeeze them into one, for testing purposes I also have it printing "new line" when it reaches the newLine < 2 condition. Right now it prints new line for every new line, whether its blank or not, and its not getting rid of the extra new lines. What am I doing wrong?

EDIT: HERE IS MY FULL CODE http://pastebin.com/bsD3b38a

So basically the program is suppose to concatenate two files together and than perform various operations on them, like what I'm trying to do which is get rid of multiple consecutive blank new lines. So in order to execute it in cygwin I do ./a -s file1 file2 Its suppose to concatenate file1 and file2 together into a file called contents.txt and than get rid of the consecutive new lines and display them on my cygwin terminal (stdout). (the -s calls the function to get rid of the consecutive lines). The third and fourth arguments passed in (file1 and file2) are the two files its suppose to concatenate together into one file called contents.txt The squeeze_lines function than reads the contents.txt file and is suppose to squeeze new lines. You can see below for an example for the contents I put in file1.txt. file2.txt just has a bunch of words followed by empty new lines.

int newLine = 1;
int c; 

if ((fileContents = fopen("fileContents.txt", "r")) == 0) 
{
    perror("fopen");
    return 1; 
}

while ((c = fgetc(fileContents)) != EOF)
{   
    if (c == '\n')
    {
        newLine++;
        if (newLine < 2) 
        {
            printf("new line");
            putchar(c); 
        }
    }
    else 
    {
        putchar(c); 
        newLine = 0;
    }
}

The file the program reads in a .txt file with these contents. Its suppose to read the file, get rid of the leading, and consecutive new lines, and output the new formatted contents to stdout on my cywgin terminal.

/* hello world program */


#include <stdio.h>

    tab
            2tabs

I think your code logic is correct. 1)By defining `newLine = 1` It will get rid of any leading '\n' of the input txt. 2) When there are a few consecutive new lines, it will only output one '\n'. — Eric Tsui, Jun 26 '15 at 06:14
@Sinstein: Yes, it is crucial that `c` is an `int` because `fgetc()`, `getc()` and `getchar()` all return an `int` and not a `char`. You can find a lot of questions covering the point. — Jonathan Leffler, Jun 26 '15 at 06:35
@Sinstein: one example of `int` vs `char` mattering is in [`while ((c = getc(file)) != EOF)` loop won't stop executing](http://stackoverflow.com/questions/13694394/). — Jonathan Leffler, Jun 26 '15 at 06:53

Jonathan Leffler · Answer 1 · 2015-06-29T00:08:54.253

Diagnosis

The logic looks correct if you have Unix line endings. If you have Windows CRLF line endings but are processing the file on Unix, you have a CR before each LF, and the CR resets newLine to zero, so you get the message for each newline.

This would explain what you're seeing.

It would also explain why everyone else is saying your logic is correct (it is — provided that the lines end with just LF and not CRLF) but you are seeing an unexpected result.

How to resolve it?

Fair question. One major option is to use dos2unix or an equivalent mechanism to convert the DOS file into a Unix file. There are many questions on the subject on SO.

If you don't need the CR ('\r' in C) characters at all, you can simply delete (not print, and not zero newLine) those.

If you need to preserve the CRLF line endings, you'll need to be a bit more careful. You'll have to record that you got a CR, then check that you get an LF, then print the pair, and then check whether you get any more CRLF sequences and suppress those, etc.

Working code — `dupnl.c`

This program only reads from standard input; this is more flexible than only reading from a fixed file name. Learn to avoid writing code which only works with one file name; it will save you lots of recompilation over time. Th code handles Unix-style files with newlines ("\n") only at the end; it also handles DOS files with CRLF ("\r\n") endings; and it also handles (old style) Mac (Mac OS 9 and earlier) files with CR ("\r") line endings. In fact, it handes arbitrary interleavings of the different line ending styles. If you want enforcement of a single mode, you have to do some work to decide which mode, and then use an appropriate subset of this code.

#include <stdio.h>

int main(void)
{
    FILE *fp = stdin;       // Instead of fopen()
    int newLine = 1;
    int c; 

    while ((c = fgetc(fp)) != EOF)
    {   
        if (c == '\n')
        {
            /* Unix NL line ending */
            if (newLine++ == 0)
                putchar(c); 
        }
        else if (c == '\r')
        {
            int c1 = fgetc(fp);
            if (c1 == '\n')
            {
                /* DOS CRLF line ending */
                if (newLine++ == 0)
                {
                    putchar(c);
                    putchar(c1);
                }
            }
            else
            {
                /* MAC CR line ending */
                if (newLine++ == 0)
                    putchar(c);
                if (c1 != EOF && c1 != '\r')
                    ungetc(c1, stdin);
            }
        }
        else
        {
            putchar(c); 
            newLine = 0;
        }
    }

    return 0;
}

Example run — inputs and outputs

$ cat test.unx


data long enough to be seen 1 - Unix

data long enough to be seen 2 - Unix
data long enough to be seen 3 - Unix
data long enough to be seen 4 - Unix



data long enough to be seen 5 - Unix


$ sed 's/Unix/DOS/g' test.unx | ule -d > test.dos
$ cat test.dos


data long enough to be seen 1 - DOS

data long enough to be seen 2 - DOS
data long enough to be seen 3 - DOS
data long enough to be seen 4 - DOS



data long enough to be seen 5 - DOS


$ sed 's/Unix/Mac/g' test.unx | ule -m > test.mac
$ cat test.mac
$ ta long enough to be seen 5 - Mac
$ odx test.mac
0x0000: 0D 0D 64 61 74 61 20 6C 6F 6E 67 20 65 6E 6F 75   ..data long enou
0x0010: 67 68 20 74 6F 20 62 65 20 73 65 65 6E 20 31 20   gh to be seen 1 
0x0020: 2D 20 4D 61 63 0D 0D 64 61 74 61 20 6C 6F 6E 67   - Mac..data long
0x0030: 20 65 6E 6F 75 67 68 20 74 6F 20 62 65 20 73 65    enough to be se
0x0040: 65 6E 20 32 20 2D 20 4D 61 63 0D 64 61 74 61 20   en 2 - Mac.data 
0x0050: 6C 6F 6E 67 20 65 6E 6F 75 67 68 20 74 6F 20 62   long enough to b
0x0060: 65 20 73 65 65 6E 20 33 20 2D 20 4D 61 63 0D 64   e seen 3 - Mac.d
0x0070: 61 74 61 20 6C 6F 6E 67 20 65 6E 6F 75 67 68 20   ata long enough 
0x0080: 74 6F 20 62 65 20 73 65 65 6E 20 34 20 2D 20 4D   to be seen 4 - M
0x0090: 61 63 0D 0D 0D 0D 64 61 74 61 20 6C 6F 6E 67 20   ac....data long 
0x00A0: 65 6E 6F 75 67 68 20 74 6F 20 62 65 20 73 65 65   enough to be see
0x00B0: 6E 20 35 20 2D 20 4D 61 63 0D 0D 0D               n 5 - Mac...
0x00BC:
$ dupnl < test.unx
data long enough to be seen 1 - Unix
data long enough to be seen 2 - Unix
data long enough to be seen 3 - Unix
data long enough to be seen 4 - Unix
data long enough to be seen 5 - Unix
$ dupnl < test.dos
data long enough to be seen 1 - DOS
data long enough to be seen 2 - DOS
data long enough to be seen 3 - DOS
data long enough to be seen 4 - DOS
data long enough to be seen 5 - DOS
$ dupnl < test.mac
$ ta long enough to be seen 5 - Mac
$ dupnl < test.mac | odx
0x0000: 64 61 74 61 20 6C 6F 6E 67 20 65 6E 6F 75 67 68   data long enough
0x0010: 20 74 6F 20 62 65 20 73 65 65 6E 20 31 20 2D 20    to be seen 1 - 
0x0020: 4D 61 63 0D 64 61 74 61 20 6C 6F 6E 67 20 65 6E   Mac.data long en
0x0030: 6F 75 67 68 20 74 6F 20 62 65 20 73 65 65 6E 20   ough to be seen 
0x0040: 32 20 2D 20 4D 61 63 0D 64 61 74 61 20 6C 6F 6E   2 - Mac.data lon
0x0050: 67 20 65 6E 6F 75 67 68 20 74 6F 20 62 65 20 73   g enough to be s
0x0060: 65 65 6E 20 33 20 2D 20 4D 61 63 0D 64 61 74 61   een 3 - Mac.data
0x0070: 20 6C 6F 6E 67 20 65 6E 6F 75 67 68 20 74 6F 20    long enough to 
0x0080: 62 65 20 73 65 65 6E 20 34 20 2D 20 4D 61 63 0D   be seen 4 - Mac.
0x0090: 64 61 74 61 20 6C 6F 6E 67 20 65 6E 6F 75 67 68   data long enough
0x00A0: 20 74 6F 20 62 65 20 73 65 65 6E 20 35 20 2D 20    to be seen 5 - 
0x00B0: 4D 61 63 0D                                       Mac.
0x00B4:
$

The lines starting $ ta are where the prompt overwrites the previous output (and the 'long enough to be seen' part is because my prompt is normally longer than just $).

odx is a hex dump program. ule is for 'uniform line endings' and analyzes or transforms data so it has uniform line endings.

Usage: ule [-cdhmnsuzV] [file ...]
  -c  Check line endings (default)
  -d  Convert to DOS (CRLF) line endings
  -h  Print this help and exit
  -m  Convert to MAC (CR) line endings
  -n  Ensure line ending at end of file
  -s  Write output to standard output (default)
  -u  Convert to Unix (LF) line endings
  -z  Check for zero (null) bytes
  -V  Print version information and exit

And could you give some suggestion for a portable version of statement `if (c == '\n')` due to different new line definition such as --DOS & Windows: \r\n 0D0A , Unix & Mac OS X: \n, 0A,Macintosh (OS 9): \r, 0D.Thanks. — Eric Tsui, Jun 26 '15 at 07:46
That could become a major essay, @EricTsui. There are multiple issues, ranging from 'text mode files with native line endings are mapped to newline endings when read and newline endings are mapped to native when written' (which affects functions such as `fgets()` too, of course), to mechanics of reading lines from files with indeterminate endings (you can't use `fgets()` or even POSIX `getline()` because they only know about native line endings), etc. I have a program `ule` (uniform line endings) which can analyze line endings and convert to DOS, Mac or Unix (and a few other tricks too). — Jonathan Leffler, Jun 26 '15 at 14:46
@Jonathan Hi thanks for the reply. I'm not sure if I'm suppose to be checking for '\r' as well. The program is reading a ".txt" file and searching through it for consecutive new lines. I am using a cygwin terminal running on a windows 8.1 machine. Although I don't think that should matter. I've added my full code to the original post with an explanation of how it works if you would be kind enough to look at it. Thank you. — nb023, Jun 26 '15 at 18:27
Oh, yes it matters! It matters a lot, in fact. Since you are using Cygwin on Windows, you'll mostly have Windows (DOS) style CRLF line endings. The chances are high that the Cygwin code doesn't map CRLF to newline during input, so my analysis probably applies. The simplest way to check is to add `if (c == '\r') puts("\nCR");` before the `if (c == '\n')` test. Then you'll see the CRs appearing before the the 'new line' messages. I included the `\n` before the CR and an implicit one after because `putchar(c)` when `c == '\r'` will overwrite what's at the beginning of the line. — Jonathan Leffler, Jun 26 '15 at 19:38
@Jonathan Okay so from what I understand my code should look like this http://pastebin.com/FbuMLbXR all this does is add CR to every 2nd line and it adds a bunch of new lines that weren't in the original file — nb023, Jun 26 '15 at 22:12
@nb023 I think the better way is keep your code unchanged , and just to process the format differences for files. When you transfer file from Windows to Cygwin, you'd use command such as `dos2unix` to turn '\r\n' (Windows style) to '\n' ending file(under *nix). — Eric Tsui, Jun 26 '15 at 22:40
@eric unfortunately this is for an assignment, and when they test it for marking they will not use the dos2unix command. Is there another way to do it like opening the file in a different mode like binary mode? — nb023, Jun 26 '15 at 23:13
Note thatmy addition of the CR message was to demonstrate that there were carriage return (CR or `'\r'`) characters in your text files. I'll look at the rest later. But you should be able to modify your code so that it eliminates pairs of CRLF (`'\r' '\n'`) characters, or pairs of NL (newline, `'\n'`) characters (and optionally pairs of CR (`'\r'`) characters too - for old Mac OS 9 files). — Jonathan Leffler, Jun 26 '15 at 23:47

Eric Tsui · Answer 2 · 2015-06-28T01:27:34.017

2

What the sample code resolved is:

1) squeeze the consecutive a few '\n' to just one '\n'

2) Get rid the leading '\n' at the beginning if there is any.

  input:   '\n\n\naa\nbb\n\ncc' 
  output:   aa'\n'    
            bb'\n' //notice, there is no blank line here
            cc

If it was the aim, then your code logic is correct for it.

By defining newLine = 1 , it will get rid of any leading '\n' of the input txt.
And when there is a remained '\n' after processing, it will output a new line to give a hint.

Back to the question itself, if the actual aim is to squeeze consecutive blank lines to just one blank line(which needs two consecutive '\n', one for terminate previous line, one for blank line).

1) Let's confirm the input and expected output firstly,

Input text:

aaa'\n' //1st line, there is a '\n' append to 'aaa'  
'\n'    //2nd line, blank line
bbb'\n' //3rd line, there is a '\n' append to 'bbb'
'\n'    //4th line, blank line
'\n'    //5th line, blank line
'\n'    //6th line, blank line
ccc     //7th line,

Expected Output text:

aaa'\n' //1st line, there is a '\n' append to 'aaa'  
'\n'    //2nd line, blank line
bbb'\n' //3rd line, there is a '\n' append to 'bbb'
'\n'    //4th line, blank line
ccc     //5th line,

2) If it is the exact program target as above,then

if (c == '\n')
{
    newLine++;
    if (newLine < 3) // here should be 3 to print '\n' twice,
                     // one for 'aaa\n', one for blank line 
    {
        //printf("new line");
        putchar(c); 
    }
}

3) If you have to process the Windows format file(with \r\n ending) under Cygwin, then you could do as follows

while ((c = fgetc(fileContents)) != EOF)
{   
    if ( c == '\r') continue;// add this line to discard possible '\r'
    if (c == '\n')
    {
        newLine++;
        if (newLine < 3) //here should be 3 to print '\n' twice
        {
            printf("new line");
            putchar(c); 
        }
    }
    else 
    {
        putchar(c); 
        newLine = 0;
    }
}

edited Jun 28 '15 at 01:27

answered Jun 26 '15 at 06:23

Eric Tsui

1,924
12
21

Hi yes my aim is to get rid of any leading multiple new lines, and squeeze the consecutive ones into just one. I can't get it to work though. – nb023 Jun 26 '15 at 18:00
I tried the code from yours, it works just fine on my system (OS X ). It will output expected result (squeeze the '\n')to stdout, while the fileContents.txt remains those redundant '\n'. So, please refer to Jonathan Leffler's suggestion to consider the format difference between Cygwin and Windows. – Eric Tsui Jun 26 '15 at 22:28
did you run my full code or just the snippet? I tried his suggestion by adding `if (c=='\r'){ puts('CR\N') }` and than afterwards `if(c=='\n') { newLine++; if (newLine > 2) {putchar(c);}` but all that does is add CR to every 2nd new line – nb023 Jun 26 '15 at 23:16
I changed it to if (c=='\'r || c == '\n') and now it just squeezes EVERYTHING to one line here is my code http://pastebin.com/6fxfbM53 – nb023 Jun 27 '15 at 00:47
@nb023 Besides that, did you change to `if(newLine <2){ putchar('\n') }` as well ? I updated my comment as above. – Eric Tsui Jun 27 '15 at 00:49
I just added it heres how my code looks now http://pastebin.com/6fxfbM53 but now the problem is, its deleting all the blank lines. What I need it do is "squeeze" consecutive (more than 2) blank lines into just one. So if you have 2, 3, 4, 5, etc blank lines, it'll squeeze them into just one blank line. The reason why I initially have `newLine = 1;` is so that if the program begins with multiple consecutive blank lines it'll squeeze them into one. – nb023 Jun 27 '15 at 00:58
@nb023 I checked you code, please delete the `putchar(c)` above `putchar('\n')`, it should be `if(newLine <2){ putchar('\n') }`. – Eric Tsui Jun 27 '15 at 01:00
Okay did that, but its still the same problem. It gets rid of all blank lines – nb023 Jun 27 '15 at 01:09
@nb023 I updated my answer to provide a more detailed description and the code. Hope if could give some help. – Eric Tsui Jun 27 '15 at 05:05

olivecoder · Answer 3 · 2015-06-26T07:10:55.523

1

[EDITED] The minimal change is:

if ( newLine <= 2)

forgive me and forget the previous code.

a slightly simpler alternative:

int c;
int duplicates=0;
while ((c = fgetc(fileContents)) != EOF)
{
    if (c == '\n') {
        if (duplicates > 1) continue;
        duplicates++;
    }
    else {
        duplicates=0;
    }
    putchar(c);
}

edited Jun 26 '15 at 07:10

answered Jun 26 '15 at 06:29

olivecoder

2,858
23
22

Hi thanks I just tried both methods you suggested and neither work. I've added my full code to in the original post in a pastebin link if you would like to take a look. – nb023 Jun 26 '15 at 18:20

WedaPashi · Answer 4 · 2015-06-26T06:29:39.267

Dry ran the code: If file starts with a newline character and newLine is 1:

For the first iteration:

if (c == '\n') //Will be evaluated as true for a new-line character. 
{
    newLine++; //newLine becomes 2 before next if condition is evaluated.
    if (newLine < 2) //False, since newLine is not less than 2, but equal.
    {
        printf("new line");
        putchar(c); 
    }
}
else //Not entered
{
    putchar(c); 
    newLine = 0;
}

On the second iteration: (Assume that it is a consecutive newline char case)

if (c == '\n') //Will be evaluated as true for a new-line character.
{
    newLine++; //newLine becomes 3 before next if condition is evaluated.
    if (newLine < 2) //False, since newLine is greater than 2.
    {
        printf("new line");
        putchar(c); 
    }
}
else //Not entered
{
    putchar(c); 
    newLine = 0;
}

So,

Initialize newLine to 0.

Hi thanks I just tried changing it to 0 and it still doesn't work. I've added my full code to in the original post in a pastebin link if you would like to take a look. — nb023, Jun 26 '15 at 18:21

score 0 · Answer 5 · answered Jun 26 '15 at 18:14

if newline > 2

That should be greater than or equal to if you want to get rid of the second line. Also you have newline strarting at one, then being incremented to two then reset to zero. Instead I recommend replacing the count with a boolean like

boolean firstNewlineFound = false

Then Whenever you find a newline set it to true; whenever it is true, delete onenewline and set it back to false.

Replacing multiple new lines in a file with just one

5 Answers5

Diagnosis

Working code — `dupnl.c`

Example run — inputs and outputs

Linked

Replacing multiple new lines in a file with just one

5 Answers5

Diagnosis

Working code — dupnl.c

Example run — inputs and outputs

Linked

Working code — `dupnl.c`