how can I write a perl script that connects to the various stanford nlp applications?

Question

How can I write a Perl script that connects to the various Stanford NLP applications?

I have invoked both the Stanford part-of-speech and named-entity applications as servers, and when I send them requests from the command line, I get the sorts of responses I expect. Here is an example command-line invocation:

cat file.txt | nc localhost 8081

I now want to write both a Perl-based command line script as well as a Perl-based CGI script to do the same work, but I am having problems getting back the full response. Here are the most salient lines in my script(s):

# initialize
my $text     = '';
my $response = '';

# get the text to process and normalize it for xml
$text =  &slurp( $file );
$text =~ s/\&/\&amp;/g;
$text =~ s/</\&lt;/g;
$text =~ s/>/\&gt;/g;
$text =~ s/\W+/ /g;

# open a connection, send the data, and get the response
my $socket = new IO::Socket::INET( PeerHost => HOST, PeerPort => PORT, Proto => PROTOCOL );
if ( ! $socket ) { die "Cannot connect to the server $!\n" }
$socket->send( "$text\n" );
$socket->recv( $response, 10240000 );
$socket->close();

This works fine for smaller files, but often does not for larger files, no matter how large I seem to increase the buffer (10240000). Moreover, the amount of data return by the server (or more specifically received by the client) is never the same size. Sometimes the response is bigger or smaller than other times.

When does recv know to stop... receiving?

What am I doing wrong?

You'll need to implement this more robustly if you're dealing with non-trivial amounts of data. Most TCP sockets have a write buffer limit and if you blow it you'll run the risk of data being ignored or your program crashing out on an error condition. Normally you use [`select`](http://perldoc.perl.org/functions/select.html) to test if your socket can be written to, as well as if any data is available to read. This goes inside a loop that polls repeatedly until you're done sending and/or reading. — tadman, Nov 27 '17 at 21:11
Interesting! How might I "poll repeatedly" because I'm pretty sure the server is getting the whole of the request. — ericleasemorgan, Nov 27 '17 at 21:26
That's what `select` does. It tells you when there's data to read or buffer space to write. If you put it inside a loop you're half way there. — tadman, Nov 27 '17 at 21:26
Some questions...do you want to submit all of the text of a file for annotation as if it were a document? Or do you want to submit each line as a separate sentence? Also, what do you mean by "small" file vs. "large" file. — StanfordNLPHelp, Nov 27 '17 at 21:52
And just to be clear, you're using the Stanford CoreNLP server: https://stanfordnlp.github.io/CoreNLP/corenlp-server.html — StanfordNLPHelp, Nov 27 '17 at 21:54
See [IO::Select](http://perldoc.perl.org/IO/Select.html) for easier `select`. There's a full example there as well. — zdim, Nov 28 '17 at 03:02
I do not know much about Perl. But by googling around it looked like people use LWP for this kind of thing. For instance, review this answer: https://stackoverflow.com/questions/4199266/how-can-i-make-a-json-post-request-with-lwp — StanfordNLPHelp, Nov 28 '17 at 03:28
StanfordNLPHelp, yes, I desire to submit an entire document, just as a whole book -- about .5 MB of data. And no, I'm not really using the CoreNLP, but instead the individual NER and POS jars. — ericleasemorgan, Nov 28 '17 at 15:08
Yes, thank you. I am aware of the Perl interface to CoreNLP, but it seems like a lot of overhead, considering my implementation below. Thanks anyway. — ericleasemorgan, Nov 28 '17 at 15:09
IO::Select looks interesting. I will investigate. Thank you. — ericleasemorgan, Nov 28 '17 at 15:14
I strongly recommend not sending .5 MB of data in a single call to the server. The typical case is to send a document of size 2.3K. You should divide the text into document size blocks and send each document separately. — StanfordNLPHelp, Nov 29 '17 at 09:11

score 0 · Answer 1 · answered Nov 28 '17 at 15:06

I believe I have resolved my issue; tadman was correct. I needed to: 1) work more robustly with "non-trivial amounts of data", and 2) I needed to implement some sort of waiting technique using select. Here is my solution which seems to work quite well:

# robustly read from a socket; see http://www.perlmonks.org/?node_id=54146

# initialize
my $text     = &slurp( $cgi->tmpFileName( $input ) );
my $rbits    = '';
my $timeout  = 20;
my $buffer   = 10240;
my $host     = 'localhost';
my $port     = '8081';
my $protocol = 'tcp';

# open the socket and write the text; needs error checking
my $socket = IO::Socket::INET->new( PeerAddr => $host, PeerPort => $port, Proto => $protocol );
$socket->write( "$text\n" );

# loop, forever
while ( 1 ) {

    # set the bit flag; ???
    vec( $rbits, $socket->fileno, 1 ) = 1;

    # wait; magic happens here
    if ( select( $rbits, undef, undef, $timeout ) >= 0 && vec( $rbits, $socket->fileno, 1 ) ) {

        # re-initialize, read and update the response
        my $stream  = '';
        my $result  = $socket->read( $stream, $buffer );
        $response  .= $stream;

        # done, maybe
        last unless $result;

    }

    else { last }

}

# done
print "$response\n";

I found the lower-level interface confusing. For example, I don't really understand vec nor the parameters to select. They seem very C-like?

In any event, I'm much further along than I used to be. Thank you.

how can I write a perl script that connects to the various stanford nlp applications?

1 Answers1