0

I'm trying to use itext 7 with iText7.PdfHtml conversion tool with big html files, and memory consumption overlap 1Gb and never getting free. Also takes a lot of time compared with others libraries even high cpu usage. Whats is going on?

        public IActionResult Index()
        {
            byte[] data = null;

            var html = System.IO.File.ReadAllBytes(@"Input.html");
            using (var htmlStream = new MemoryStream(html))
            {
                using (var pdfStream = new MemoryStream())
                {
                    using (var pdfWriter = new PdfWriter(pdfStream))
                    {
                        using (var pdfDocument = new PdfDocument(pdfWriter))
                        {
                            using (var document = HtmlConverter.ConvertToDocument(htmlStream, pdfDocument, new ConverterProperties()))
                            {
                                pdfDocument.SetDefaultPageSize(iText.Kernel.Geom.PageSize.A4);
                            }
                        }
                    }
                    data = pdfStream.GetBuffer();
                    return new FileStreamResult(new MemoryStream(data), System.Net.Mime.MediaTypeNames.Application.Pdf);
                }
            }
        }

Using FileStreams

        public IActionResult Index()
        {
            var tempFileName = Path.GetTempFileName();
            using (var htmlStream = new FileStream(@"Input.html", FileMode.Open))
            using (var pdfStream = new FileStream(tempFileName, FileMode.Create))
            using (var pdfWriter = new PdfWriter(pdfStream))
            using (var pdfDocument = new PdfDocument(pdfWriter))
            using (var document = HtmlConverter.ConvertToDocument(htmlStream, pdfDocument, new ConverterProperties()))
                pdfDocument.SetDefaultPageSize(iText.Kernel.Geom.PageSize.A4);

            return new FileStreamResult(new FileStream(tempFileName, FileMode.Open), System.Net.Mime.MediaTypeNames.Application.Pdf);
        }

Memory and CPU Process

BIG HTML FILE (https://www.dropbox.com/s/fqkvcnvsvp1mjz4/Input.zip?dl=0)

Enner Pérez
  • 11
  • 1
  • 6

2 Answers2

1

it's been a long time, maybe you fixed it already. But, Just in case someone else is looking for something like this; The way we got our service memory consumption under control was using the PdfDocument object methods: SetCloseWriter, SetCloseReader and SetFlushUnusedObjects

PdfDocument pdfDoc = null;
MemoryStream memStream = null;
PdfWriter pdfWriter = null;
try
        {
            ConverterProperties props = new ConverterProperties();
            FontProvider localFontProvider = new FontProvider(CustomDefaultFontProvider.GetFontSet());
            props.SetFontProvider(localFontProvider);
            props.SetBaseUri(BaseUriPath);
            using (memStream = new MemoryStream())
            {
                using(pdfWriter = new PdfWriter(memStream, wp))
                {
                    pdfWriter.SetCloseStream(true);
                    using (pdfDoc = new PdfDocument(pdfWriter))
                    {
                        pdfDoc.SetDefaultPageSize(PageSize.LETTER);
                        pdfDoc.SetCloseWriter(true);
                        pdfDoc.SetCloseReader(true);
                        pdfDoc.SetFlushUnusedObjects(true);
                        await Task.Run(() => HtmlConverter.ConvertToPdf(reader, pdfDoc, props));
                        pdfDoc.Close();
                    }
                }
                buffer = await Task.FromResult(memStream.ToArray());
            }
        }
        catch (Exception)
        {
            throw;
        }
        finally
        {
            if (pdfDoc != null && !pdfDoc.IsClosed()) { pdfDoc.Close(); }
            if (pdfWriter != null) { pdfWriter.Dispose(); }
            if (memStream != null) { memStream.Dispose(); }
            GCSettings.LargeObjectHeapCompactionMode = GCLargeObjectHeapCompactionMode.CompactOnce;
            GC.Collect(); 
        }

        return buffer;
Uriel Fernandez
  • 101
  • 1
  • 5
-1

I have no experience for iText7 PdfHtml, but there is a workaround solution I recommended for you which is kblok/puppeteer-sharp based on puppeteer with Chrome Headless to generate PDF file from HTML render, as the code below.

Generate PDF files

await new BrowserFetcher().DownloadAsync(BrowserFetcher.DefaultRevision);
var browser = await Puppeteer.LaunchAsync(new LaunchOptions
{
    Headless = true
});
var page = await browser.NewPageAsync();
await page.GoToAsync("http://www.google.com");
await page.PdfAsync(outputFile);

For optimizing the CPU and Memory usgae, you can refer to the SO thread Limit chrome headless CPU and memory usage.

Note: if you want to deploy your app to Azure cloud, due to Azure Web App sandbox, Azure Web App for Windows can not be used with GDI. Only Azure VM, Web App for Linux, or Azure Container services can be used for deploying the app with puppeteer based on Chrome Headless.

Hope it helps.

Peter Pan
  • 23,476
  • 4
  • 25
  • 43
  • 1
    That is not an answer to the question. OP's code already works, they are just asking for some performance tweaks. You're proposing that they do an entire architecture change. Also, these Chrome Headless PDF converters do a *visual* conversion, while pdfHTML does a *structural* conversion. I don't see how your Puppeteer solution produces a correctly *tagged* PDF file. Also, OP's product may not even be a web application, so Azure or any other cloud platform may be irrelevant. – Amedee Van Gasse Nov 14 '19 at 11:23
  • Thanks so much for you suggestions, but Amedee Van Gasse is right, my question is more related to perfomancen of structural convension of large html files; anyways I apreciate it; I tried you suggestion, unfurtunally, it takes too much time to init Puppeterer, however, Pdf get generated almost instantly but same as iText7 take so much RAM like 800mb as I expected, it's Chromium after all. =) I invite you to download my HTML file and test with you suggestions. – Enner Pérez Nov 14 '19 at 16:59