I am trying to pull text off a Wiki page, store it's formatting, and transfer it all to a PDF.
I know the ITextSharp library can help me put it in a PDF, but how would I go about pulling the text off the website while keeping the formatting?
I am trying to pull text off a Wiki page, store it's formatting, and transfer it all to a PDF.
I know the ITextSharp library can help me put it in a PDF, but how would I go about pulling the text off the website while keeping the formatting?
Not so familiar with running C scripts, but my experience might help a bit. I use Perl to write scripts on a UNIX server. Then I have my PHP and JS files hosted in the htdocs folder. Now - In my PHP/JS code I call shell execute to run my .pl file.
$command = "/mt_path/my_file_name.pl 2>&1";
exec($command, $exec_output_lines);
Now, you can have a program on your UNIX server which converts text to PDF. So simply call that program, and send it the text in that command line. Then save the file temporarily, and give the user the temp_url to it. Then delete it.
Hope it gives you a start...
If you are looking for the super easy/free way to do this, check out wkhtmltopdf.org
You can run it from the System.Diagnostics.Process
class:
System.Diagnostics.Process.Start("wkhtmltopdf.exe", "http://www.google.com google.pdf");
If you want to learn to do it yourself, its super hard. Start by downloading the HTML using System.Net.WebClient
:
using(var client = new System.Net.WebClient()) {
var html = client.DownloadString("http://www.google.com");
}
Then use an HtmlParser like HtmlAgilityPack to find all of the CSS and images. (Don't use regex to parse html)
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var cssNodes = doc.DocumentElement.SelectNodes("//link[@rel='stylsheet']");
var imgNodes = doc.DocumentElement.SelectNodes("//img[@src]");
Download those files, then implement an HtmlRenderer, (you know, like WebKit). Then, oh crap I forgot, run the JavaScript (with your own JavaScript runtime, like V8) in case it modifies something in the DOM or CSS.
Then, take that rendered HTML page and write a PDF renderer. Which is also hard. There's a hundred companies that don't do it well...
Or... Just use wkhtmltopdf. Or essentialobjects, or aspose. All are good solutions.