0

The following program is intended to make a copy of one .exe application file.But just one little thing determines whether it indeed gives me a proper copy of the intended file RealPlayer.exe or gives me a corrupted file.

What I do is read from the source file in binary mode and write to the new copy in the same mode.For this I use a variable ch.But if ch is of type char, I get a corrupted file which has a size of few bytes while the original file is 26MB.But if I change the type of ch to int, the program works fine and gives me the exact copy of RealPlayer.exe sized 26MB.So let me ask two questions that arise from this premise.I would appreciate if you can answer both parts:

1) Why does using type char for ch mess things up while int type works?What is wrong with char type?After all, shouldn't it read byte by byte from the original file(as char is one byte itself) and write it byte by byte to the new copy file?After all isn't what the int type does,ie, read 4 bytes from original file and then write that to the copy file?Why the difference between the two?

2) Why is the file so small-sized compared to original file if we use char type for ch?Let's forget for a moment that the copied file is corrupt to begin with and focus on the size.Why is it that the size is so small if we copy character by character (or byte by byte), but is big(original size) when we copy "integer by integer" (or 4-bytes by 4-bytes)?

I was suggested by a friend to simply stop asking questions and use int because it works while char doesn't!!.But I need to understand what's going on here as I see a serious lapse in my understanding in this matter.Your detailed answers are much sought.Thanks.

#include<stdio.h>
#include<stdlib.h>

int main()
{
char ch;   //This is the cause of problem
//int ch;   //This solves the problem
FILE *fp,*tp;

fp=fopen("D:\\RealPlayer.exe","rb");
tp=fopen("D:\\copy.exe","wb");
if(fp==NULL||tp==NULL)
{
    printf("Error opening files");
    exit(-1);
}

while((ch=getc(fp))!=EOF)
putc(ch,tp);

fclose(fp);
fclose(tp);

}
Charles
  • 50,943
  • 13
  • 104
  • 142
Rüppell's Vulture
  • 3,583
  • 7
  • 35
  • 49
  • @JimBalter Thanks.That was a helpful link. – Rüppell's Vulture May 10 '13 at 07:54
  • See also http://msdn.microsoft.com/en-us/library/yeby3zcb%28vs.71%29.aspx relating to some of your comments below. On Windows systems, in text mode only, ctrl-Z (026) will be interpreted as end-of-file. What that means is that getc will return EOF (-1), *not* ctrl-Z, if a ctrl-Z is encountered. Which means that opening binary files in text mode on Windows systems is a very bad thing to do. – Jim Balter May 10 '13 at 08:14
  • @JimBalter Thanks again.If you've ever found any of my comments as against SO rules & FAQ,then I apologize to you. – Rüppell's Vulture May 10 '13 at 08:16

2 Answers2

4

The problem is in the termination condition for the loop. In particular, the type of the variable ch, combined with rules for implicit type conversions.

while((ch=getc(fp))!=EOF)

getc() returns int - either a value from 0-255 (i.e. a char) or -1 (EOF).

You stuff the result into a char, then promote it back to int to do the comparison. Unpleasant things happen, such as sign extension.

Let's assume your compiler treats "char" as "signed char" (the standard gives it a choice).

You read a bit pattern of 0xff (255) from your binary file - that's -1, expressed as a char. That gets promoted to int, giving you 0xffffffff, and compared with EOF (also -1, i.e 0xffffffff). They match, and your program thinks it found the end of file, and obediently stops copying. Oops!

One other note - you wrote:

After all isn't what the int type does,ie, read 4 bytes from original file and then write that to the copy file?

That's incorrect. getc(fp) behaves the same regardless of what you do with the value returned - it reads exactly one byte from the file, if there's one available, and returns that value - as an int.

Arlie Stephens
  • 1,146
  • 6
  • 20
  • Why then it works fine while copying text files in text mode?Your argument about `getc()` returning `int` holds there too. – Rüppell's Vulture May 10 '13 at 06:03
  • 1
    @Rüppell'sVulture: It is rare to encounter the byte `0xff` in a text file. It sometimes happens, but it is rare. – Dietrich Epp May 10 '13 at 06:04
  • @ArlieStephens Isn't EOF ASCII 26? – Rüppell's Vulture May 10 '13 at 06:08
  • @DietrichEpp Isn't ASCII of EOF supposed to be 26 instead of -1? – Rüppell's Vulture May 10 '13 at 06:09
  • If it's a US Ascii text file, 0xff is impossible, not just rare, AFAIK. Characters with the high bit set are generally used for special and non-english characters. In the old days, there were terminal devices which couldn't handle the high bit being set, so it wasn't used. Ascii's been extended since, but nothing got moved to new values. – Arlie Stephens May 10 '13 at 06:09
  • @ArlieStephens Isn't EOF ASCII 26?Or -1? – Rüppell's Vulture May 10 '13 at 06:10
  • I don't have the standard that specifies fgetc() behavior handy, so I simply used grep on a handy linux system; I found `#define EOF (-1)` in /usr/include/stdio.h. The program quoted above includes stdio.h, so that's the value it's presumably using, unless the standard let the value be implementation dependent. – Arlie Stephens May 10 '13 at 06:12
  • @ArlieStephens Please clarify what you wrote.How come EOF is -1.Isn't it supposed to be ASCII 26 or CTRL-Z? – Rüppell's Vulture May 10 '13 at 06:17
  • 1
    Excuse me guys allow me--> as per standard EOF is an implementation-defined negative integer 7.19 Input/output 7.19.1 Introduction EOF which expands to an integer constant expression, with type int and a negative value, that is returned by several functions to indicate end-of-file, that is, no more input from a stream – Dayal rai May 10 '13 at 06:21
  • @Rüppell'sVulture: EOF is not an actual character, so it is not part of ASCII. EOF is just a constant which `getc()` returns when it encounters the end of file. – Dietrich Epp May 10 '13 at 06:23
  • @Rüppell's Vulture - The value that matters here is the value your program sees. That's going to be what it gets from the header files. On my system, that's -1, and it's been -1 on every *nix system I can remember. It MAY be that it's 26 in some other context, e.g. in the ascii specification - but when I did "man ascii" as a convenient way to look at the values, I found decimal 26 names SUB (substitute). The closest thing to an EOF I found was EOT (end of transmission), with the value 4. – Arlie Stephens May 10 '13 at 06:24
  • 1
    Slight correction, `getc` returns either an `unsigned char` converted to `int` or `EOF`; and `EOF` is usually `-1`, but as Dayal rai just wrote, it could be a different negative value. – Daniel Fischer May 10 '13 at 06:24
  • @DanielFischer How would you contrast between EOF and the substitute character 26 on the ASCII table? – Rüppell's Vulture May 10 '13 at 06:26
  • @Dayal rai - Thanks. I thought it might be something like that, but while I keep an (outdated) copy of the C standard around, I don't know my way around other relevant standards. I presume this is from posix? – Arlie Stephens May 10 '13 at 06:27
  • @DanielFischer With your an Dietrich's comment above,I am close to what I intended to know.I can see how 255 is equivalent to a -1 and hence the loop exits....It's just the 26 ASCII that holds the confusion.What is it for then? – Rüppell's Vulture May 10 '13 at 06:27
  • @ArlieStephens `printf("%d",EOF)` gives **-1** on my computer. – Rüppell's Vulture May 10 '13 at 06:28
  • 1
    @Rüppell'sVulture `EOF` is a macro expanding to a negative `int` value per the C standard, it has nothing to do with any ASCII character. – Daniel Fischer May 10 '13 at 06:28
  • @DanielFischer Again a major confusion.Look at the accepted answer for this link.It seems to suggest CTRL-Z is equal to 26, and is the EOF.http://stackoverflow.com/questions/229924/difference-between-files-writen-in-binary-and-text-mode – Rüppell's Vulture May 10 '13 at 06:34
  • @Rüppell'sVulture That's Windows, not C. Outside my expertise. What I know is that on Windows, typing Ctrl-Z at the terminal signals end-of-input like Ctrl-D does on *nix. How that is encoded, no idea. – Daniel Fischer May 10 '13 at 06:41
  • @DanielFischer I captured the value of Ctrl-Z from keyboard and it is -1 indeed.Based on this,and your and Dietrich's answer,don't you think the top answer to the following question is incorrect as it states `If the file is opened in append mode, the end of the file will be examined for a ctrl-z character (character 26) and that character removed` http://stackoverflow.com/questions/229924/difference-between-files-writen-in-binary-and-text-mode – Rüppell's Vulture May 10 '13 at 08:02
  • @Rüppell'sVulture It's more complicated than that. When you type Ctrl-Z, the OS interprets that before your programme gets to see it. What the OS sends to the programme is yet another question (which I don't know the answer to). Then the runtime environment makes the programme return -1 for the next `getc` operation. Yet another thing is files. Some OSs did mark the end of file with a special byte (methinks it was 0x1a indeed), Windows kept that at least partially. How it actually is nowadays, you'd have to ask somebody who knows Windows. – Daniel Fischer May 10 '13 at 08:22
  • @DanielFischer Thanks.That would be enough for a medium-level learner like me. – Rüppell's Vulture May 10 '13 at 08:23
3
int getc ( FILE * stream );

Returns the character currently pointed by the internal file position indicator of the specified stream. On success, the character read is returned (promoted to an int value).If you have already defined ch as int all works fine but if ch is defined as char, returned value from getc() is supressed back to char.

above reasons are causing corruption in data and loss in size.

Dayal rai
  • 6,548
  • 22
  • 29