0

I am trying to pull text off a Wiki page, store it's formatting, and transfer it all to a PDF.

I know the ITextSharp library can help me put it in a PDF, but how would I go about pulling the text off the website while keeping the formatting?

Alexei Levenkov
  • 98,904
  • 14
  • 127
  • 179
Roy
  • 35
  • 4
  • 1
    This doesn't sound like a little project for getting one's feet wet. But perhaps you are a freaking genius? – adv12 Jun 12 '14 at 19:46
  • 1
    There are least 3 questions - how to read text from web site (aka how to use `WebClient`), how to parse HTML (aka give me HtmlAgility pack how-to) and create PDF from HTML. Please make sure to research what already asked before and ask each new one separately. – Alexei Levenkov Jun 12 '14 at 19:49
  • Side note: there is no need to add "thank you", "new here/with C#" to your post. Level of knowledge/care is quite obvious in most cases from the question anyway. – Alexei Levenkov Jun 12 '14 at 19:50

2 Answers2

1

Not so familiar with running C scripts, but my experience might help a bit. I use Perl to write scripts on a UNIX server. Then I have my PHP and JS files hosted in the htdocs folder. Now - In my PHP/JS code I call shell execute to run my .pl file.

$command = "/mt_path/my_file_name.pl 2>&1";
exec($command, $exec_output_lines);

Now, you can have a program on your UNIX server which converts text to PDF. So simply call that program, and send it the text in that command line. Then save the file temporarily, and give the user the temp_url to it. Then delete it.

Hope it gives you a start...

KingsInnerSoul
  • 1,373
  • 4
  • 20
  • 49
1

If you are looking for the super easy/free way to do this, check out wkhtmltopdf.org

You can run it from the System.Diagnostics.Process class:

System.Diagnostics.Process.Start("wkhtmltopdf.exe", "http://www.google.com google.pdf");

If you want to learn to do it yourself, its super hard. Start by downloading the HTML using System.Net.WebClient:

using(var client = new System.Net.WebClient()) {
  var html = client.DownloadString("http://www.google.com");
}

Then use an HtmlParser like HtmlAgilityPack to find all of the CSS and images. (Don't use regex to parse html)

var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var cssNodes = doc.DocumentElement.SelectNodes("//link[@rel='stylsheet']");
var imgNodes = doc.DocumentElement.SelectNodes("//img[@src]");

Download those files, then implement an HtmlRenderer, (you know, like WebKit). Then, oh crap I forgot, run the JavaScript (with your own JavaScript runtime, like V8) in case it modifies something in the DOM or CSS.

Then, take that rendered HTML page and write a PDF renderer. Which is also hard. There's a hundred companies that don't do it well...

Or... Just use wkhtmltopdf. Or essentialobjects, or aspose. All are good solutions.

Community
  • 1
  • 1
Zachary Yates
  • 12,966
  • 7
  • 55
  • 87