Organizing a large list of Twitter users based on whether they do or don't follow 2 very popular accounts

Question

I have a list of roughly 160,000 unique Twitter users, gathered using NodeXL. This data is currently in Excel format, but can easily be moved over to R.

I want to know whether each of these 160,000 Twitter users follows A) just @BernieSanders, B) just @DonaldTrump, C), both @BernieSanders and @DonaldTrump, or D) neither @BernieSanders nor @DonaldTrump.

I know of 2 basic ways to complete this task: 1) access all of Bernie Sanders' and Donald Trump's followers, and then cross reference those lists with my list of 160,000 Twitter users, OR 2) access which accounts all 160,000 Twitter users are following, and check for instances of @BernieSanders and/or @DonaldTrump.

The problem is, both of these methods are very computationally intensive, considering my sample size and the massive number of followers that each politician has.

Just to clarify--I do not currently have any data on who follows these politicians, or who these 160,000 Twitter users are following.

How can I complete this task without frying my computer? Any/all suggestions/recommendations are welcome! Solutions that utilize R are especially welcome, since I am familiar with that language.

UPDATE: My data-at the current time-simply looks like this:

   User
brittbrittr32
drugsrebadmkay
alleyahhb
charles_preset
lilsaint___west
sarkassum
johnlockesknife
ohmsbeliver
wtvvrkay
hdyorker
ackmanscam
lacecierraa
_mikyy_
thevoyles
debrasmith37
craftyliberal
msftteee
julia_maries
coriana_hunt
me0w24
maria_lupinacci
bayrleu
rockythegrea9t9
wesfreedomlover
ronwilreagan
bombasticviwe
mimi38760907
pinkcloud15
andrew_whitebm
piperdewn
patsteinwand
tomjon12
solo_mariajose
nomineetrump
rghbfoxchase
marksoria
col_nj
cutnwood

So, it's just a long list of Twitter account names. No information on followership, whatsoever.

160.000 isn't large at all. R barely gets its feet wet. As @docendodiscimus mentioned, let's see a sample of the data. — Roman Luštrik, Jun 10 '16 at 07:18
I think option 1 will be your best bet- however you need to play a clever game to workaround twitter api limits when getting Sanders & Trump followers — Altons, Jun 10 '16 at 12:43
@Altons I had a feeling that option 1 was going to be the most viable. Do you know of such a clever workaround? — waxattax, Jun 10 '16 at 17:23
@waxattax how are ur skills in R or Python - have a look a this link http://stackoverflow.com/questions/17431807/get-all-follower-ids-in-twitter-by-tweepy — Altons, Jun 11 '16 at 10:52
@waxattax once you have them in ur local drive you can use either R or Python ( or any tool that handle large amount of data - maybe latest Excel? ) - just make sure you pull followers names if that's the info you have - do not their internal ids - if you want to try the above solution use an user that have small number of followers first (ie 10 followers) — Altons, Jun 11 '16 at 10:55

Kuantew · Answer 1 · 2016-06-10T07:34:41.483

0

you can work out in excel and filter the content to make it appropriate for R to work with

However you can also do this(way to do it with R)

paste all of the data in excel with follower names on rows and who it follows on columns

                                  **donald trump**       **bernie sanders**
             **first person**        follows                 NA
             **second person**       follows                 follows

than create a function in R :

       filterout <- function(mydata,numbof_followers) {
       data <- data.frame()
       i <- 1
       while(TRUE) {
       csv <- read.table("mydata",nrows=i,...//your args here)
       if(csv[i,1]||csv[i,2]==NA) {
       next
        }
       i <- i+1
       data <- rbind(data,csv)
       if(i>=numbof_followers) {
         break
         }
        }
       return data
       }

then you can do filterout("excelfilename",numberoffollowers)

I could have developed the function better for instance finding the numberoffollowers itself or other functionalities but i left it to you i just gave out the basic understanding which is :

Use loops to filter out data without reading them all

Good luck

edited Jun 10 '16 at 07:34

answered Jun 10 '16 at 07:12

Kuantew

134
6

Still some room for improvement, tbh. Why, for example, do you you `while(TRUE)`? – talat Jun 10 '16 at 07:21
I seriously don't know, just wanted to show how it could be done. you can use the for loop repeat loop everything in this case main purpose is to filtering out with looping :) – Kuantew Jun 10 '16 at 07:22
Hmm, it's certainly possible by "traditional" looping but in R we often try to find solutions that avoid explicit looping where possible, i.e. we look for vectorized approaches that are often a lot faster (and potentially cleaner code) than loops – talat Jun 10 '16 at 07:25
Hmm I see i'm not a R programmer so just tried with the traditional method in this case, thanks for your help. – Kuantew Jun 10 '16 at 07:30

Organizing a large list of Twitter users based on whether they do or don't follow 2 very popular accounts

1 Answers1