0

I'm a super new at this and working on R for my thesis. The code in this answer finally worked for me (Extracting data from an API using R), but I can't figure out how to add a loop to it. I keep getting the first page of the API when I need all 3360. Here's the code:

    library(httr)
    library(jsonlite)
    r1 <- GET("http://data.riksdagen.se/dokumentlista/? 
    sok=&doktyp=mot&rm=&from=2000-01-01&tom=2017-12- 31&ts=&bet=&tempbet=&nr=&org=&iid=&webbtv=&talare=&exakt=&planering=&sort=rel&sortorder=desc&rapport=&utformat=json&a=s#soktraff")

r2 <- rawToChar(r1$content)

class(r2)
r3 <- fromJSON(r2)

r4 <- r3$dokumentlista$dokument

By the time I reach r4, it's already a data frame.

Please and thank you!

Edit: originally, I couldn't get a url that had the page as info within it. Now I have it (below). I still haven't been able to loop it. "http://data.riksdagen.se/dokumentlista/?sok=&doktyp=mot&rm=&from=2000-01-01&tom=2017-12-31&ts=&bet=&tempbet=&nr=&org=&iid=&webbtv=&talare=&exakt=&planering=&sort=rel&sortorder=desc&rapport=&utformat=json&a=s&p="

Larissa
  • 27
  • 7
  • 1
    Do you know what needs to change in your code to, say, get the second page? Is there some part of the URL that needs to be modified? – Gregor Thomas Feb 07 '19 at 14:41
  • I've looked in the url and the page number isn't there. It's within the code and only appears in a way that I call it in r3, that is, after parsing it. By the way, since this is Swedish data, the word for "page" is "sida" and for "pages", "sidor". I tried adding the loop before and after it and it doesn't work. – Larissa Feb 07 '19 at 14:47
  • Thanks, that helps. The title and text made it seem like maybe you knew how to get the next page, just didn't know how to code it up in a loop. This comment makes it clearer. Perhaps edit your question to include that information and your learnings from the one answer so far. – Gregor Thomas Feb 07 '19 at 16:39
  • Yeah, sorry. I've figured that out though, through the answer by DS_UNI. – Larissa Feb 07 '19 at 16:46
  • Right, so please edit that information into your question. If someone new comes to try to answer, show them that information in the question so they don't have to run your code, then read these comments and the existing answer and put it all together themselves. – Gregor Thomas Feb 07 '19 at 16:53

1 Answers1

0

I think you can extract the url of the next page from r3 as follows:

next_url <- r3$dokumentlista$`@nasta_sida`
# you need to re-check this, but sometimes I'm getting white spaces within the url, 
# you may not face this problem, but in any case this line of code solved the issue 
next_url <- gsub(' ', '', n_url)

GET(next_url)

Update

I tried the url with the page number with 10 pages and it worked

my_dfs <- lapply(1:10, function(i){
  my_url <- paste0("http://data.riksdagen.se/dokumentlista/?sok=&doktyp=mot&rm=&from=2000-01-01&tom=2017-12-31&ts=&bet=&tempbet=&nr=&org=&iid=&webbtv=&talare=&exakt=&planering=&sort=rel&sortorder=desc&rapport=&utformat=json&a=s&p=", i)
  r1 <- GET(my_url)
  r2 <- rawToChar(r1$content)
  r3 <- fromJSON(r2)
  r4 <- r3$dokumentlista$dokument
  return(r4)
})

Update 2:

The extracted data frames are complex (e.g. some columns are lists of data frames) which is why a simple rbind will not work here, you'll have to do some pre-processing before you stack up the data together, something like this would work

my_dfs %>% lapply(function(df_0){
      # Do some stuff here with the data, and choose the variables you need
      # I chose the first 10 columns to check that I got 200 different observations
      df_0[1:10]
    }) %>% do.call(rbind, .)
DS_UNI
  • 2,600
  • 2
  • 11
  • 22
  • Thank you! It didn't quite work, but I got the url with a page number ([1] "http://data.riksdagen.se/dokumentlista/?u17=22&sok=&doktyp=mot&rm=&from=2000-01-01&tom=2017-12-31&ts=&bet=&tempbet=&nr=&org=&iid=&webbtv=&talare=&exakt=&planering=&sort=rel&sortorder=desc&rapport=&utformat=json&a=s&p=2"). Maybe I can loop it now. [UPDATE]: Nope, I still can't do it. – Larissa Feb 07 '19 at 16:12
  • why? did you get an error? please see my updated answer – DS_UNI Feb 07 '19 at 20:38
  • 1
    No, it's not really an error. It just comes up with 20 observations, doesn't matter what I try. Including your updated answer. (And I know that for the whole timeframe, there should be over 67 thousand observations). – Larissa Feb 07 '19 at 20:46
  • aha ok! I didn't check the extracted data – DS_UNI Feb 07 '19 at 20:51
  • I just checked, and I got per page : 57 variables and 20 observations, sounds right to you? (sorry the first comment was wrong I corrected it ) – DS_UNI Feb 07 '19 at 20:55
  • Yeah, that's all that ever comes up, no matter how I add the loop. 57 variables, 20 obs. I need all 67 thousand obs, which is I need a loop that could grab at least 1000 obs. at once (maybe 500). – Larissa Feb 07 '19 at 21:33
  • I think you understood me wrong, note that I'm getting 20 obs. **per page**, so in total (from 10 pages) I get 200. In the code `my_dfs` is a list of 10 data frames each is extracted from one page. – DS_UNI Feb 08 '19 at 08:52
  • Got it! I can't thank you enough and trust me, you will be on my thesis acknowledgments. :) – Larissa Feb 08 '19 at 11:55
  • Glad I could help! I, as well as others, would really appreciate it if you could accept my answer, that way the answer might help others in the future. And good luck with your work :) – DS_UNI Feb 08 '19 at 12:07
  • Oh, of course! I didn't know that was a thing! Seriously thank you so much, I'll be able to get a lot done with this. – Larissa Feb 08 '19 at 12:50
  • Hi, @DS_UNI. I was wondering if I could trouble you again. There's one piece of information that I haven't been able to get no matter what I do. I even made it into a question if you wanna answer there (https://stackoverflow.com/questions/54639133/a-follow-up-to-extracting-data-from-an-api-using-r) But it seems that it's a list of data.frames (called "intressent") and it comes up as either NULL or error. I promise, last thing. Thank you! – Larissa Feb 12 '19 at 11:40