5

I am working on a community detection project with Twitter data where I need to create a network on the basis of relationships. I collected and filtered 200,000 UIDs. My next step involves creating a friend/follower network among them.

I am using Ruby scripts and a Twitter gem to collect, process and store data. To overcome the API calls limit, I am using Apigee proxy so there is no issue of rate limiting for now.

The call to get the relationship status between two UIDs is at: https://dev.twitter.com/docs/api/1/get/friendships/show

I need to speed up the process of collecting data. Currently I have many scripts running simultaneously in my terminal. I find this method very hard to manage and scale. Is there a faster, efficient and manageable way to do the same thing? OR Is there a completely different and better approach which I am missing?

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
s2n
  • 327
  • 3
  • 9
  • if the 1 answer doesn't help, consider adding information about why you think it is taking too long. If you have a bunch of scripts running simul., either a job control tool, or a master script may be appropriate. Given the network programming aspect of your project, I would have thought there would be ruby gems to help with this. How deeply have you looked on that front. Scaling may mean you need to look at GNU-parallel, Amazon Elastic Cloud or other. Also what about large scale data-processing tools like Hadoop (would almost certainly require custom coding in java or ??). Good luck. – shellter Feb 24 '12 at 22:18
  • And from looking at the dev.twitter link you've included, the json doc, looks ripe for loading into MongoDB. (This from a person that is on chapter 4 of MongoDB in Action (Manning Press, no affiliation). ). The book includes an example of retrieving data from twitter directly into the DB. So might be worth a look. Good luck. – shellter Feb 24 '12 at 22:24
  • job control tool or master script is what i am looking at. Any suggestions for those? Also will a change in programming lang cause any significant increase in speed? – s2n Feb 25 '12 at 14:13
  • Why do you need to speed the process up instead of just waiting for it to take as long as it takes? – Taylor Singletary Feb 24 '12 at 17:26
  • If i let the process continue at the current speed, it will take way too long for my purpose. So not an option. – s2n Feb 25 '12 at 14:18
  • kindly review the answers and select one. – Mike Q Aug 27 '19 at 17:25

4 Answers4

0

One thing I could think of is to use EC2 instance and deploy the script, you could get the biggest instance and use it for a couple hours. One benefit is you have more powerful instance and faster internet connection.

And also If you're just collecting public data which means you don't have to authenticate via OAuth (Please correct me if I'm wrong), I would use Perl script or Python which is faster than Ruby with Gem.

toy
  • 11,711
  • 24
  • 93
  • 176
0

Why not use logstash to collect the data. Logstash gives you plenty of options to sent the data to so that you can easily filter through it. You can even filter all your data via logstash before you send it to an output. Output options available are Elasticsearch (used to search, analyze, and visualize it in real time), databases (mysql, MSSQL, etc.) and much more.

Logstash - https://www.elastic.co/products/logstash

Twitter Logstash Plugin - https://www.elastic.co/guide/en/logstash/current/plugins-inputs-twitter.html

johnslippers
  • 81
  • 1
  • 16
0

Use A Threading Wrapper Script

A treaded bash or python wrapper script may be all you need. A script that will split the work up and call it for you automatically. The bonus to this is that you would not have to rewrite too much to get it to work. The hypothetical below would potentially reduce runtime from 111 hours to 1.1 hours.

Say your current solution is this:

file_of_200k_uids.txt
ruby ruby_script.rb "file_of_200k_uids.txt"

So the ruby_script.rb runs through all 200K UIDs and performs the network task which takes say 2sec per equates to 400,000 seconds.

Proposed solution (write a wrapper thread with BASH4+):

file_of_200k_uids.txt
ruby ruby_script.rb "file_of_200k_uids.txt"
bash_thread_manager.sh

Contents of bash_thread_manager.sh would be something like this :

# -- Step one have the bash script break down the large file --
# and place the results in a /path/to/folder
cp file_of_200k_uids.txt /path/to/folder/temp_file_of_200k_uids.txt
split -d -b 10M file_of_200k_uids.txt uids_list
rm /path/to/folder/temp_file_of_200k_uids.txt

# -- Now run through the folders and launch the script you need to do the work --
# -- it will create instances of your script up to a max number (e.g. 100) --
child="$$"
for filename in /path/to/folder/*; do

    num_children=$(ps --no-headers -o pid --ppid=$child | wc -w)
    let num_children=num_children-1

    if [[ $num_children -gt 100 ]] ; then
        sleep 60
    else
        ruby ruby_script.rb "$filename" > /output/result-${RANDOM}.txt &
    fi

done
wait
# -- final step would be a for loop that combines all of the files
cat /output/result-*.txt >> all.txt

The bash script will manage calling the UIDs from a file and collect the data as separate threads up to some number you define. In the example below we split temp_file_of_200k_uids.txt into smaller 10MB files max and then call 100 of these 10MB files at once using the bash script. Any time it drops below 100 threads it increases it back up to 100. Now you can do it 100x faster and so forth.

Further reading: https://linoxide.com/linux-how-to/split-large-text-file-smaller-files-linux/ Multithreading in Bash

Mike Q
  • 6,716
  • 5
  • 55
  • 62
0

You can try to use nokogori and parse the HTML page for https://twitter.com/#!/USERNAME/followers

Paulo Fidalgo
  • 21,709
  • 7
  • 99
  • 115