Downloading pdf file using WebRequests

Question

I'm trying to download a number of pdf files automagically given a list of urls.

Here's the code I have:

HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);

request.Method = "GET";

var encoding = new UTF8Encoding();

request.Headers.Add(HttpRequestHeader.AcceptLanguage, "en-gb,en;q=0.5");
request.Headers.Add(HttpRequestHeader.AcceptEncoding, "gzip, deflate");

request.Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8";
request.UserAgent = "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:12.0) Gecko/20100101 Firefox/12.0";

HttpWebResponse resp = (HttpWebResponse)request.GetResponse();

BinaryReader reader = new BinaryReader(resp.GetResponseStream());

FileStream stream = new FileStream("output/" + date.ToString("yyyy-MM-dd") + ".pdf",FileMode.Create);

BinaryWriter writer = new BinaryWriter(stream);

while (reader.PeekChar() != -1)
      {
       writer.Write(reader.Read());
      }
       writer.Flush();
       writer.Close();

So, I know the first part works. I was originally getting it and reading it using a TextReader - but that gave me corrupted pdf files (since pdfs are binary files).

Right now if I run it, reader.PeekChar() is always -1 and nothing happens - I get an empty file.

While debugging it, I noticed that reader.Read() was actually giving different numbers when I was invoking it - so maybe Peek is broken.

So I tried something very dirty

try
{
 while (true)
   {
    writer.Write(reader.Read());
    }
 }
   catch
      {
      }
 writer.Flush();
 writer.Close();

Now I'm getting a very tiny file with some garbage in it, but its still not what I'm looking for.

So, anyone can point me in the right direction?

Additional Information:

The header doesn't suggest its compressed or anything else.

HTTP/1.1 200 OK
Content-Type: application/pdf
Server: Microsoft-IIS/7.5
X-Powered-By: ASP.NET
Date: Fri, 10 Aug 2012 11:15:48 GMT
Content-Length: 109809

score 23 · Accepted Answer · answered Aug 10 '12 at 12:11

23

Skip the BinaryReader and BinaryWriter and just copy the input stream to the output FileStream. Briefly

var fileName = "output/" + date.ToString("yyyy-MM-dd") + ".pdf";
using (var stream = File.Create(fileName))
  resp.GetResponseStream().CopyTo(stream);

answered Aug 10 '12 at 12:11

Martin Liversage

104,481
22
209
256

2

I wonder if there is a way to get this into a byte array instead of sending it to the file system? – MetaGuru Aug 24 '15 at 20:12
3

@ioSamurai: Replace `File.Create(filename)` with `new MemoryStream()` and then at the end of the `using` block retrieve the bytes: `var bytes = stream.ToArray()`. A `MemoryStream` does not use any unmanaged resources so you can also drop the `using` block entirely. – Martin Liversage Aug 24 '15 at 20:33
@MartinLiversage hmm I have tried this a few times and while I do get a byte stream, when I ultimately write it to disk the pdf file is corrupt... however making the same request from the browser (I am using WebRequest in code) gives the PDF file fine. This may actually be some strange behavior related to how Report Server serves up PDF responses to web requests... – MetaGuru Aug 24 '15 at 20:40
@ioSamurai: I am pretty sure that the few lines of code I have provided does not corrupt a PDF file and I would be surprised if Report Server has a "strange behavior". To troubleshoot you can compare the first few bytes of the file and the length of the file using both your own code, a tool like Fiddler to see the stream in transit and the file retrieved using a web browser. – Martin Liversage Aug 24 '15 at 20:54

score 10 · Answer 2 · answered Aug 10 '12 at 12:12

10

Why not use the WebClient class?

using (WebClient webClient = new WebClient())
{
    webClient.DownloadFile("url", "filePath");
}

answered Aug 10 '12 at 12:12

Sergey Vyacheslavovich Brunov

17,291
7
48
81

I needed to be able to change the request headers. – Aabela Aug 10 '12 at 12:17
1

@Aabela, yeah, please take a look at [WebClient.Headers Property](http://msdn.microsoft.com/en-us/library/system.net.webclient.headers.aspx). – Sergey Vyacheslavovich Brunov Aug 10 '12 at 12:19

score 2 · Answer 3 · answered Aug 10 '12 at 12:12

2

Your question asks about WebClient but your code shows you using Raw HTTP Requests & Resposnses.

Why don't you actually use the System.Net.WebClient ?

using(System.Net.WebClient wc = new WebClient()) 
{
    wc.DownloadFile("http://www.site.com/file.pdf",  "C:\\Temp\\File.pdf");
}

answered Aug 10 '12 at 12:12

Eoin Campbell

43,500
17
101
157

Sorry, fixed original question. The reason I went for raw HTTP requests/response is because I need to modify the headers myself. – Aabela Aug 10 '12 at 12:16
yep. it does that too. just saw your comment below. live and learn :-) – Eoin Campbell Aug 10 '12 at 12:27

score 0 · Answer 4 · edited Oct 09 '20 at 20:04

        private void Form1_Load(object sender, EventArgs e)
        {
  
            WebClient webClient = new WebClient();
            webClient.DownloadFileCompleted += new AsyncCompletedEventHandler(Completed);
            webClient.DownloadProgressChanged += new DownloadProgressChangedEventHandler(ProgressChanged);
            webClient.DownloadFileAsync(new Uri("https://www.colorado.gov/pacific/sites/default/files/Income1.pdf"), @"output/" + DateTime.Now.Ticks ("")+ ".pdf", FileMode.Create);
        }

        private void ProgressChanged(object sender, DownloadProgressChangedEventArgs e)
        {
            progressBar = e.ProgressPercentage;
        }

        private void Completed(object sender, AsyncCompletedEventArgs e)
        {
            MessageBox.Show("Download completed!");
        }
    }
}

Downloading pdf file using WebRequests

4 Answers4

Linked