1

I want to identify those tweets containing URL in my twitter data set. For example, using the sign of "http://".

How to proceed it in R? for example the tweets texts are

  "@RainxDog @twitpic Please HELP #OccupyWallStreet and RT this video: http://t.co/vjwNR7TC"

  "@degamuna Please HELP #OccupyWallStreet and RT this video: http://t.co/vjwNR7TC"
Frank Wang
  • 1,462
  • 3
  • 17
  • 39

3 Answers3

3

You can use grep

if(length(grep("http://",data))>0){
 data[grep("http://",data)]
}
shhhhimhuntingrabbits
  • 7,397
  • 2
  • 23
  • 23
3

Your relatively simple question, hides something that is actually very tricky. In your two examples, the urls:

  1. were of the form: http://t.cp/ - what about bit.ly links? What about https?
  2. the urls appeared at the end of the tweet. What about urls in the middle or start of the tweet?

Construct a set of sample tweets and make sure that your regular expression works.

Basically, you need a regular expression. Stackoverflow questions to look at are:

  1. How to extract a URL from a Tweet with a JavaScript RegEx?
  2. What's the cleanest way to extract URLs from a string using Python?

These questions also contain links.

Community
  • 1
  • 1
csgillespie
  • 59,189
  • 14
  • 150
  • 185
0

You can get all the URLs of a tweet using Twitter Entities. When you make the REST call, make sure you include

&include_entities=true

This will give you a section in the JSON or XML called entities. There will be a child node called urls.

Here's an example of what will be returned.

"text": "Twitter for Mac is now easier and faster, and you can open multiple windows at once http://t.co/0JG5Mcq",

    "entities": {

      "media": [

      ],

      "urls": [

        {

          "url": "http://t.co/0JG5Mcq",

          "display_url": "blog.twitter.com/2011/05/twitte…",

          "expanded_url": "http://blog.twitter.com/2011/05/twitter-for-mac-update.html",

          "indices": [

            84,

            103

          ]

        }

      ],

      "user_mentions": [

      ],

      "hashtags": [

      ]

    }

So, look for entities -> urls to see if a tweet contains a link to an external site.

Terence Eden
  • 14,034
  • 3
  • 48
  • 89