I need to turn HTML into equivalent Markdown-structured text.
OBS.: Quick and clear way of doing this with PHP & Python.
As I am programming in PHP, some people indicates Markdownify to do the job, but unfortunately, the code is not being updated and in fact it is not working. At sourceforge.net/projects/markdownify there is a "NOTE: unsupported - do you want to maintain this project? contact me! Markdownify is a HTML to Markdown converter written in PHP. See it as the successor to html2text.php since it has better design, better performance and less corner cases."
From what I could discover, I have only two good choices:
Python: Aaron Swartz's html2text.py
Ruby: Singpolyma's html2markdown.rb, based on Nokogiri
So, from PHP, I need to pass the HTML code, call the Ruby/Python Script and receive the output back.
(By the way, a folk made a similar question here ("how to call ruby script from php?") but with no practical information to my case).
Following the Tin Man`s tip (bellow), I got to this:
PHP code:
$t='<p><b>Hello</b><i>world!</i></p>';
$scaped=preg_quote($t,"/");
$program='python html2md.py';
//exec($program.' '.$scaped,$n); print_r($n); exit; //Works!!!
$input=$t;
$descriptorspec=array(
array('pipe','r'),//stdin is a pipe that the child will read from
array('pipe','w'),//stdout is a pipe that the child will write to
array('file','./error-output.txt','a')//stderr is a file to write to
);
$process=proc_open($program,$descriptorspec,$pipes);
if(is_resource($process)){
fwrite($pipes[0],$input);
fclose($pipes[0]);
$r=stream_get_contents($pipes[1]);
fclose($pipes[1]);
$return_value=proc_close($process);
echo "command returned $return_value\n";
print_r($pipes);
print_r($r);
}
Python code:
#! /usr/bin/env python
import html2text
import sys
print html2text.html2text(sys.argv[1])
#print "Hi!" #works!!!
With the above I am geting this:
command returned 1 Array ( [0] => Resource id #17 1 => Resource id #18 )
And the "error-output.txt" file says:
Traceback (most recent call last): File "html2md.py", line 5, in print html2text.html2text(sys.argv1) IndexError: list index out of range
Any ideas???
Ruby code (still beeing analysed)
#!/usr/bin/env ruby
require_relative 'html2markdown'
puts HTML2Markdown.new("<h1>#{ ARGF.read }</h1>").to_s
Just for the records, I tryed before to use PHP's most simple "exec()" but I got some problemas with some special characters very common to HTML language.
PHP code:
echo exec('./hi.rb');
echo exec('./hi.py');
Ruby code:
#!/usr/bin/ruby
puts "Hello World!"
Python code:
#!usr/bin/python
import sys
print sys.argv[1]
Both working fine. But when the string is a bit more complicated:
$h='<p><b>Hello</b><i>world!</i></p>';
echo exec("python hi.py $h");
It did not work at all.
That's because the html string needed to have its special characters scaped. I got it using this:
$t='<p><b>Hello</b><i>world!</i></p>';
$scaped=preg_quote($t,"/");
Now it works like I said here.
I am runnig: Fedora 14 ruby 1.8.7 Python 2.7 perl 5.12.2 PHP 5.3.4 nginx 0.8.53