8

I need to turn HTML into equivalent Markdown-structured text.

OBS.: Quick and clear way of doing this with PHP & Python.

As I am programming in PHP, some people indicates Markdownify to do the job, but unfortunately, the code is not being updated and in fact it is not working. At sourceforge.net/projects/markdownify there is a "NOTE: unsupported - do you want to maintain this project? contact me! Markdownify is a HTML to Markdown converter written in PHP. See it as the successor to html2text.php since it has better design, better performance and less corner cases."

From what I could discover, I have only two good choices:

  • Python: Aaron Swartz's html2text.py

  • Ruby: Singpolyma's html2markdown.rb, based on Nokogiri

So, from PHP, I need to pass the HTML code, call the Ruby/Python Script and receive the output back.

(By the way, a folk made a similar question here ("how to call ruby script from php?") but with no practical information to my case).

Following the Tin Man`s tip (bellow), I got to this:

PHP code:

$t='<p><b>Hello</b><i>world!</i></p>';
$scaped=preg_quote($t,"/");
$program='python html2md.py';

//exec($program.' '.$scaped,$n); print_r($n); exit; //Works!!!

$input=$t;

$descriptorspec=array(
   array('pipe','r'),//stdin is a pipe that the child will read from
   array('pipe','w'),//stdout is a pipe that the child will write to
   array('file','./error-output.txt','a')//stderr is a file to write to
);

$process=proc_open($program,$descriptorspec,$pipes);

if(is_resource($process)){
    fwrite($pipes[0],$input);
    fclose($pipes[0]);
    $r=stream_get_contents($pipes[1]);
    fclose($pipes[1]);
    $return_value=proc_close($process);
    echo "command returned $return_value\n";
    print_r($pipes);
    print_r($r);
}

Python code:

#! /usr/bin/env python
import html2text
import sys
print html2text.html2text(sys.argv[1])
#print "Hi!" #works!!!

With the above I am geting this:

command returned 1 Array ( [0] => Resource id #17 1 => Resource id #18 )

And the "error-output.txt" file says:

Traceback (most recent call last): File "html2md.py", line 5, in print html2text.html2text(sys.argv1) IndexError: list index out of range

Any ideas???


Ruby code (still beeing analysed)

#!/usr/bin/env ruby
require_relative 'html2markdown'
puts HTML2Markdown.new("<h1>#{ ARGF.read }</h1>").to_s

Just for the records, I tryed before to use PHP's most simple "exec()" but I got some problemas with some special characters very common to HTML language.

PHP code:

echo exec('./hi.rb');
echo exec('./hi.py');

Ruby code:

#!/usr/bin/ruby
puts "Hello World!"

Python code:

#!usr/bin/python
import sys
print sys.argv[1]

Both working fine. But when the string is a bit more complicated:

$h='<p><b>Hello</b><i>world!</i></p>';
echo exec("python hi.py $h");

It did not work at all.

That's because the html string needed to have its special characters scaped. I got it using this:

$t='<p><b>Hello</b><i>world!</i></p>';
$scaped=preg_quote($t,"/");

Now it works like I said here.

I am runnig: Fedora 14 ruby 1.8.7 Python 2.7 perl 5.12.2 PHP 5.3.4 nginx 0.8.53

Community
  • 1
  • 1
Roger
  • 8,286
  • 17
  • 59
  • 77
  • possible duplicate of [WMD markdown editor - HTML to Markdown conversion](http://stackoverflow.com/questions/1196672/wmd-markdown-editor-html-to-markdown-conversion) suggests [Markdownify](http://milianw.de/projects/markdownify/index.php) to convert from HTML to Markdown with PHP. – Gordon Jan 06 '11 at 21:59
  • Well they are discussing different things on "WMD markdown editor - HTML to Markdown conversion", although they in fact are trying to turn HTML into Markdown. Note also that that topic is still unsolved and that there isn't any good PHP program that do the job. "Markdownify" is mentioned but actually the project were left by the author and the code is not working. – Roger Jan 07 '11 at 02:35
  • Trying to pass the string to be changed on the command-line is a very fragile solution and can break easily. – the Tin Man Jan 07 '11 at 02:48
  • If there is any better way to do this using PHP's "exec()", then I agree with you, it is not a solution. – Roger Jan 07 '11 at 03:34

5 Answers5

12

Have PHP open the Ruby or Python script via proc_open, piping the HTML into STDIN in the script. The Ruby/Python script reads and processes the data and returns it via STDOUT back to the PHP script, then exits. This is a common way of doing things via popen-like functionality in Perl, Ruby or Python and is nice because it gives you access to STDERR in case something blows chunks and doesn't require temp files, but it's a bit more complex.

Alternate ways of doing it could be writing the data from PHP to a temporary file, then using system, exec, or something similar to call the Ruby/Python script to open and process it, and print the output using their STDOUT.

EDIT:

See @Jonke's answer for "Best practices with STDIN in Ruby?" for examples of how simple it is to read STDIN and write to STDOUT with Ruby. "How do you read from stdin in python" has some good samples for that language.

This is a simple example showing how to call a Ruby script, passing a string to it via PHP's STDIN pipe, and reading the Ruby script's STDOUT:

Save this as "test.php":

<?php
$descriptorspec = array(
   0 => array("pipe", "r"),  // stdin is a pipe that the child will read from
   1 => array("pipe", "w"),  // stdout is a pipe that the child will write to
   2 => array("file", "./error-output.txt", "a") // stderr is a file to write to
);
$process = proc_open('ruby ./test.rb', $descriptorspec, $pipes);

if (is_resource($process)) {
    // $pipes now looks like this:
    // 0 => writeable handle connected to child stdin
    // 1 => readable handle connected to child stdout
    // Any error output will be appended to /tmp/error-output.txt

    fwrite($pipes[0], 'hello world');
    fclose($pipes[0]);

    echo stream_get_contents($pipes[1]);
    fclose($pipes[1]);

    // It is important that you close any pipes before calling
    // proc_close in order to avoid a deadlock
    $return_value = proc_close($process);

    echo "command returned $return_value\n";
}
?>

Save this as "test.rb":

#!/usr/bin/env ruby

puts "<b>#{ ARGF.read }</b>"

Running the PHP script gives:

Greg:Desktop greg$ php test.php 
<b>hello world</b>
command returned 0

The PHP script is opening the Ruby interpreter which opens the Ruby script. PHP then sends "hello world" to it. Ruby wraps the received text in bold tags, and outputs it, which is captured by PHP, and then output. There are no temp files, nothing passed on the command-line, you could pass a LOT of data if need-be, and it would be pretty fast. Python or Perl could easily be used instead of Ruby.

EDIT:

If you have:

HTML2Markdown.new('<h1>HTMLcode</h1>').to_s

as sample code, then you could begin developing a Ruby solution with:

#!/usr/bin/env ruby

require_relative 'html2markdown'

puts HTML2Markdown.new("<h1>#{ ARGF.read }</h1>").to_s

assuming you've already downloaded the HTML2Markdown code and have it in the current directory and are running Ruby 1.9.2.

Community
  • 1
  • 1
the Tin Man
  • 158,662
  • 42
  • 215
  • 303
  • Thank you very much. Would you have an example to give me please? – Roger Jan 07 '11 at 01:35
  • @Roger, the PHP links for "proc_open", "system" and "exec" have example code at the bottom of the pages. See the edit in my answer for examples for Ruby and PHP. – the Tin Man Jan 07 '11 at 03:12
  • I executed your code and the output was: "command returned 1". It looks that "proc_open()" gives us much more control over the data, although, as I never used it, it seams pretty confuse at first. I'm still picturing how to make Ruby (or Python) process the HTML input. – Roger Jan 07 '11 at 03:57
  • The output says "`command returned 1`" because something wasn't done correctly. Try running `echo 'foo' | ruby test.rb` from the same directory where you saved the "test.rb" file. You should get "`foo`". – the Tin Man Jan 07 '11 at 04:02
  • Sure, I got it: "hello world command returned 0". Now picturing how to make it happen inside Ruby or Python. – Roger Jan 07 '11 at 04:17
5

In Python, have PHP pass the var as a command line argument, get it from sys.argv (the list of command line arguments passed to Python), and then have Python print the output, which PHP then echoes. Example:

#!usr/bin/python
import sys

print "Hello ", sys.argv[1] # 2nd element, since the first is the script name

PHP:

<?php
echo exec('python script.py Rafe');
?>

The procedure should be basically the same in Ruby.

Rafe Kettler
  • 75,757
  • 21
  • 156
  • 151
  • Rafe, please, I have added a comment to you reply above. – Roger Jan 07 '11 at 02:37
  • @Roger I see what you've done, and I'm not sure what could be producing the error. What's the output? – Rafe Kettler Jan 07 '11 at 03:10
  • @Roger I've taken a look at it and the problem is the ! character; this has special significance to the bash shell. So what you can do is escape special characters, or pick a more robust solution (I'm sure that there are markdown modules for PHP) – Rafe Kettler Jan 07 '11 at 03:17
  • As I said above, if there is any better way to do this (pass the html text to other program) using PHP's "exec()", then I agree with you, it is not a solution. By the way, I am looking for a way to transform HTML into Markdown and not the opposite. – Roger Jan 07 '11 at 03:36
  • @Roger you should look into Markdownify (http://milianw.de/projects/markdownify/). You can pass text around from PHP to Python and back, but it would be way easier to just use a PHP module. – Rafe Kettler Jan 07 '11 at 03:38
  • As I saind, it is broken. At http://sourceforge.net/projects/markdownify/ there is a "NOTE: unsupported - do you want to maintain this project? contact me! Markdownify is a HTML to Markdown converter written in PHP. See it as the successor to `html2text.php` since it has better design, better performance and less corner cases." – Roger Jan 07 '11 at 04:12
  • Rafe, would you mind to show me how the code in Python would be? All I could achieve up to now is: `import html2text; import sys; for line in sys.stdin: t = line; print html2text(html)` – Roger Jan 07 '11 at 13:18
2

Use a variable in the Ruby code, and pass it in as an argument to the Ruby script from the PHP code. Then, have the Ruby script return the processed code into stdout which PHP can read.

Daniel Walker
  • 760
  • 6
  • 11
0

Another very weird approach will be like the one i used.

Php file -> output.txt
ruby file -> read from output.txt
Ruby file-> result.txt
Php file -> read from result.txt

simple add exec(rubyfile.rb);

Not recommended but this will work for sure.

demonplus
  • 5,613
  • 12
  • 49
  • 68
Akif Hazarvi
  • 47
  • 1
  • 11
0

I think your question is wrong. Your problem is how to convert from HTML to Markdown. Am I right?

Try this http://milianw.de/projects/markdownify/ I think it could help you =)

Evaldo Junior
  • 377
  • 2
  • 7
  • As I saind, it is broken. At sourceforge.net/projects/markdownify there is a "NOTE: unsupported - do you want to maintain this project? contact me! Markdownify is a HTML to Markdown converter written in PHP. See it as the successor to html2text.php since it has better design, better performance and less corner cases." – Roger Jan 08 '11 at 13:44
  • Evaldo, this solved my problem: http://stackoverflow.com/questions/4686842/how-to-stdin-and-stdout-with-php-and-python-to-use-html2text-and-get-a-markdown-f – Roger Jan 14 '11 at 14:28