1

I'm trying to download a PDF file from the net. The downloaded file is an HTML file (cannot be opened with Adobe Reader). The download is done with result 200.

 string tempFile = System.IO.Path.GetTempFileName();
 tempFile = System.IO.Path.ChangeExtension(tempFile, "pdf");
   
 HttpResponseMessage msg;
 using (HttpClient client = new HttpClient())
 {
 msg = await client.GetAsync($"https://www.anaf.ro/StareD112/ObtineRecipisa?numefisier=217776607.pdf");

            if (msg.IsSuccessStatusCode)
            {
                using (var file = File.Create(tempFile))
                {
                    var contentStream = await msg.Content.ReadAsStreamAsync(); 
                    await contentStream.CopyToAsync(file);
                    await file.FlushAsync();
                }
            }
        }

The downloaded file has the content:

<!DOCTYPE html><html><head><meta http-equiv="Pragma" content="no-cache"/><meta http-equiv="Expires" content="-1"/><meta http-equiv="CacheControl" content="no-cache"/><meta http-equiv="Content-Type" content="text/html; charset=utf-8"/><link rel="shortcut icon" href="data:;base64,iVBORw0KGgo="/><script>(function(){window["bobcmn"] = "111111101...})();</script><script type="text/javascript" src="/TSPD/08b919fd7aab2000e3f4522c0108677748c289d7a5e2904fcfef59fb022a19a5fd44f0a0664ad54a?type=10"></script><noscript>Please enable JavaScript to view the page content.<br/>Your support ID is: 10865965003120525066.</noscript>

Can you tell me what I'm not doing right?

Puiu .
  • 21
  • 2
  • 1
    *Can you tell me what I'm not doing right?* nothing ... the file is HTML file not pdf - it's prolly some kind of protection from data scraping – Selvin Jul 24 '20 at 14:00
  • My guess is that you're not hitting the right URL, but there may not be a URL available that will simply stream the file to you. It looks as if the URL you are using leads to an HTML file. If, when you visit it in the browser, it streams a PDF to you, then it's the page itself that's connecting to the file server (or initiating the PDF generation process). – Ann L. Jul 24 '20 at 14:11
  • If you try in a browser the link from the example you will be able to download that PDF file that can be read with Adobe Acrobat Reader DC. The file I download by code from that link is the one attached in the statement of the initial question. Until 1 day ago, the code downloads the PDF file correctly. I assume something has changed in the server, but I can't know that. – Puiu . Jul 24 '20 at 19:23

1 Answers1

0

The url https://www.anaf.ro/StareD112/ObtineRecipisa?numefisier=217776607.pdf shows captcha depending on request: enter image description here

Your code downloads the captcha's HTML.

You need to use some technique to prevent detection of scraping. Try to use Selenium + headless browser. For more information:

  1. Selenium + Chrome
    Running Selenium with Headless Chrome Webdriver
    Downloading with chrome headless and selenium

  2. Selenium + Phantom JS
    C# example of using PhantomJS webdriver ExecutePhantomJS to filter out images
    How to capture a file download using phantomJS

Vitaliy Shibaev
  • 1,420
  • 10
  • 24
  • Strange thing! For me, no matter with which browser I open that link, the PDF document opens directly. I tried emptying the browser cache and with freshly installed Windows, but that captcha code never appears ... Maybe it appears to you because you are in another country? I tried another experiment: in a simple WinForms application I put a WebBrowser to which I set the Navigate property with the link above. The PDF file appears correctly! Now I'm looking for a way to save browser content to a PDF file. – Puiu . Jul 28 '20 at 09:10
  • By the way, I cannot reproduce this captcha today. I tried to load your url from Tor Browser and reproduced 3rd behavior - it redirects me to http://mentenanta.anaf.ro/. – Vitaliy Shibaev Jul 28 '20 at 10:24
  • Sometimes the site goes into maintenance.This is the website of our Ministry of Finance! I think the easiest way is to find a way to save the contents of the WebBrowser control in the c # application to a PDF file. I haven't found it yet! – Puiu . Jul 28 '20 at 13:36