2

I have a column "place" in my table which contains data about a place that looks like:

{ "id" : "94965b2c45386f87", "name" : "New York", "boundingBoxCoordinates" : [ [ { "longitude" : -79.76259, "latitude" : 40.477383 }, { "longitude" : -79.76259, "latitude" : 45.015851 }, { "longitude" : -71.777492, "latitude" : 45.015851 }, { "longitude" : -71.777492, "latitude" : 40.477383 } ] ], "countryCode" : "US", "fullName" : "New York, USA", "boundingBoxType" : "Polygon", "URL" : "https://api.twitter.com/1.1/geo/id/94965b2c45386f87.json", "accessLevel" : 0, "placeType" : "admin", "country" : "United States" }

From this, I want to extract the country name. I have tried the following code:

loc <- t1$place
loc = gsub('"', '', loc)
loc = gsub(',', '', loc)

to clean up the string and now it looks like this:

"{ id : 00ed6f0947c230f4 name : Caloocan City boundingBoxCoordinates : [ [ { longitude : 120.9607709 latitude : 14.6344661 } { longitude : 120.9607709 latitude : 14.7873208 } { longitude : 121.1015117 latitude : 14.7873208 } { longitude : 121.1015117 latitude : 14.6344661 } ] ] countryCode : PH fullName : Caloocan City National Capital Region boundingBoxType : Polygon URL : https://api.twitter.com/1.1/geo/id/00ed6f0947c230f4.json accessLevel : 0 placeType : city country : Republika ng Pilipinas }"

Now to extract the country name, I want to use the word() function:

word(loc, n, sep=fixed(" : "))

where n in the position of the country name I still did not count. But this function gives the correct output when n=1 but gives an error for any other vaue of n:

Error in word[loc, "start"] : subscript out of bounds

Why is that happening? The loc variable certainly has more words with that separation. Or can someone suggest a better way of extracting the country name from that field?

EDIT: t1 is the dataframe that consists my entire table. Presently I am interested only in the place field of my table which has the information in the above mentioned format. Hence I am trying to load the place field into a separate variable called "loc" using the basic assignment instruction:

loc <- t1$place

In order to read it as a JSON, the place field needs to be delimited by single quotes which it is not originally. I have 2 millions rows in my table so I really can't manually add the delimiters.

Sandy Muspratt
  • 31,719
  • 12
  • 116
  • 122
kpks
  • 59
  • 1
  • 7

1 Answers1

4

This looks like a JSON object so it would be easier to use a JSON parse to extract the data.

So if this your string value

x <- '{ "id" : "94965b2c45386f87", "name" : "New York", "boundingBoxCoordinates" : [ [ { "longitude" : -79.76259, "latitude" : 40.477383 }, { "longitude" : -79.76259, "latitude" : 45.015851 }, { "longitude" : -71.777492, "latitude" : 45.015851 }, { "longitude" : -71.777492, "latitude" : 40.477383 } ] ], "countryCode" : "US", "fullName" : "New York, USA", "boundingBoxType" : "Polygon", "URL" : "https://api.twitter.com/1.1/geo/id/94965b2c45386f87.json", "accessLevel" : 0, "placeType" : "admin", "country" : "United States" }'

then you can do

library(jsonlite)
# or library(RJSOINIO)
# or library(rjson)

fromJSON(x)$country
# [1] "United States"
MrFlick
  • 195,160
  • 17
  • 277
  • 295
  • Thanks for the prompt reply. Please forgive my ignorance but the field place does not have single quotes in the start and end of the string. So it is showing an error when I try something like: x <- 't1$place' or loc <- fromJSON('t1$place') I definitely can not insert single quotes manually since I have about 2 million records to change. Can you suggest something for this?? – kpks May 15 '15 at 21:33
  • I would assume `fromJSON(t1$place)` would work. You didn't really supply enough details to make the data [reproducible](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) (see that link for tips). – MrFlick May 15 '15 at 21:36
  • `code`loc <- fromJSON(t1$place) gives an error: Error in fromJSON(t1$place) : STRING_ELT() can only be applied to a 'character vector', not a 'integer' – kpks May 15 '15 at 21:40
  • How about `fromJSON(as.character(t1$place))` (again, it would be easier if you provided a reproducible example since at this point i can only guess based on the error message). – MrFlick May 15 '15 at 21:42
  • Wow, this worked but only the first row is being parsed. Should I write a loop for all the rows of the table? Also, I edited the question. Please let me know if that suffices. Sorry about that incomplete information. – kpks May 15 '15 at 21:48