2

I'm making an application which takes in a user's tweets using the Twitter API and one component of it is performing sentiment extraction from the tweet texts. For development I'm using xampp, of course using the Apache HTML Server as my workspace. I'm using Eclipse for PHP as an IDE.

For the sentiment extraction I'm using the uClassify Sentiment Classifier. The Classifier uses an API to receive a number of requests and with each request it sends back XML data from which the sentiment values can be parsed.

Now the application may process a large number of tweets (maximum allowed is 3200) at once. For example if there are 3200 tweets then the system will send 3200 API calls at once to this Classifier. Unfortunately for this number the system does not scale well and in fact xampp crashes after a short while of running the system with these calls. However, with a modest number of tweets (for example 500 tweets) the system works fine, so I am assuming it may be due to large number of API calls. It may help to note that the maximum number of API calls allowed by uClassify per day is 5000, but since the maximum is 3200 I am pretty sure that it is not exceeding this number.

This is pretty much my first time working on this kind of web development, so I am not sure if I'm making a rookie mistake here. I am not sure what I could be doing wrong and don't know where to start looking. Any advice/insight will help a lot!

EDIT: added source code in question

Update index method

function updateIndex($timeline, $connection, $user_handle, $json_index, $most_recent) {
    // URL arrays for uClassify API calls
    $urls = [ ];
    $urls_id = [ ];

    // halt if no more new tweets are found
    $halt = false;
    // set to 1 to skip first tweet after 1st batch
    $j = 0;
    // count number of new tweets indexed
    $count = 0;
    while ( (count ( $timeline ) != 1 || $j == 0) && $halt == false ) {
        $no_of_tweets_in_batch = 0;
        $n = $j;
        while ( ($n < count ( $timeline )) && $halt == false ) {
            $tweet_id = $timeline [$n]->id_str;
            if ($tweet_id > $most_recent) {
                $text = $timeline [$n]->text;
                $tokens = parseTweet ( $text );
                $coord = extractLocation ( $timeline, $n );
                addSentimentURL ( $text, $tweet_id, $urls, $urls_id );
                $keywords = makeEntry ( $tokens, $tweet_id, $coord, $text );
                foreach ( $keywords as $type ) {
                    $json_index [] = $type;
                }
                $n ++;
                $no_of_tweets_in_batch ++;
            } else {
                $halt = true;
            }
        }
        if ($halt == false) {
            $tweet_id = $timeline [$n - 1]->id_str;

            $timeline = $connection->get ( 'statuses/user_timeline', array (
                    'screen_name' => $user_handle,
                    'count' => 200,
                    'max_id' => $tweet_id 
            ) );
            // skip 1st tweet after 1st batch
            $j = 1;
        }
        $count += $no_of_tweets_in_batch;
    }

    $json_index = extractSentiments ( $urls, $urls_id, $json_index );

    echo 'Number of tweets indexed: ' . ($count);
    return $json_index;
}

extract sentiment method

function extractSentiments($urls, $urls_id, &$json_index) {
    $responses = multiHandle ( $urls );
    // add sentiments to all index entries
    foreach ( $json_index as $i => $term ) {
        $tweet_id = $term ['tweet_id'];
        foreach ( $urls_id as $j => $id ) {
            if ($tweet_id == $id) {
                $sentiment = parseSentiment ( $responses [$j] );
                $json_index [$i] ['sentiment'] = $sentiment;
            }
        }
    }
    return $json_index;
}

Method for handling multiple API calls

This is where the uClassify API calls are being processed at once:

function multiHandle($urls) {

    // curl handles
    $curls = array ();

    // results returned in xml
    $xml = array ();

    // init multi handle
    $mh = curl_multi_init ();

    foreach ( $urls as $i => $d ) {
        // init curl handle
        $curls [$i] = curl_init ();

        $url = (is_array ( $d ) && ! empty ( $d ['url'] )) ? $d ['url'] : $d;

        // set url to curl handle
        curl_setopt ( $curls [$i], CURLOPT_URL, $url );

        // on success, return actual result rather than true
        curl_setopt ( $curls [$i], CURLOPT_RETURNTRANSFER, 1 );

        // add curl handle to multi handle
        curl_multi_add_handle ( $mh, $curls [$i] );
    }

    // execute the handles
    $active = null;
    do {
        curl_multi_exec ( $mh, $active );
    } while ( $active > 0 );

    // get xml and flush handles
    foreach ( $curls as $i => $ch ) {
        $xml [$i] = curl_multi_getcontent ( $ch );
        curl_multi_remove_handle ( $mh, $ch );
    }

    // close multi handle
    curl_multi_close ( $mh );

    return $xml;
}
mesllo
  • 545
  • 7
  • 29
  • You get one http request coming in to your xampp web server, i.e. a username, then in PHP you look up their tweets, then still in the same PHP script you loop through those tweets, making one call to an external web service (uClassify) for each? If so, it might be useful to post your PHP script (or, if it is long, the main loop - I'm especially interested in how you do the web service calls, and how you do them in parallel.) – Darren Cook Feb 10 '15 at 09:20
  • I'm using cURL for calling uClassify. I don't process one call per tweet every time, instead I add a curl handle to a multi-handle (still in cURL) and then using multi_exec() (for curl multi handle) it processes them in parallel or so is said according to libcurl documentation. This makes the process of handling all these tweets **much** faster, but it does not scale well for a large number of tweets. I only successfully tested it with around 300-500 tweets, it does not work properly with 3200 tweets. I have edited the post with the code. – mesllo Feb 10 '15 at 17:31

1 Answers1

3

The problem is with giving curl too many URLs in one go. I am surprised you can manage 500 in parallel, as I've seen people complain of problems with even 200. This guy has some clever code to just 100 at a time, but then add the next one each time one finishes, but I notice he edited it down to just do 5 at a time.

I just noticed the author of that code released an open source library around this idea, so I think this is the solution for you: https://github.com/joshfraser/rolling-curl

As to why you get a crash, a comment on this question suggests the cause might be reaching the maximum number of OS file handles: What is the maximum number of cURL connections set by? and other suggestions are simply using a lot of bandwidth, CPU and memory. (If you are on windows, opening the task manager should allow you to see if this is the case; on linux use top)

Community
  • 1
  • 1
Darren Cook
  • 27,837
  • 13
  • 117
  • 217
  • Wow, this really really helps, I will definitely look into the links! I will post here again to see if I can work something around with this. Thank you! – mesllo Feb 10 '15 at 18:00