0

I am wondering how I can figure out if a website provide photo or video by checking its URL. I investigated the website that I am interested in and found that most of the links I have are in this form: (I am not sure if I can actually name the website, so for now I just wrote it in a form of an example):

http://www.example.com/abcdef

where example is the main domain and abcdef is a number like 69964. The interesting pattern I found is that after entering this URL, if it actually has video the URL will change automatically to https://www.example.com/abcdef#mode=tour while if it's just a photo, it will change to https://www.example.com/abcdef#mode=0

Now I have a list of URLs from this website and I just want to check if it has photo or video, or it's not working (invalid URL). Is there anyway to do that?

Ross_you
  • 881
  • 5
  • 22
  • `ifelse(stringr::str_detect(url, "tour$"), "Is a video", "Is a picture")`? – Dunois Oct 29 '20 at 18:43
  • Yes, I guess this is correct, but I still have a problem. The main URL I have is in the form of `http://www.example.com/abcdef` and I have to open a browser and manually enter this URL and then after that the URL will change to either `https://www.example.com/abcdef#mode=tour` or `https://www.example.com/abcdef#mode=0`. So we still have a problem here and that is how to find if the URL is in a from of `https://www.example.com/abcdef#mode=0` or `https://www.example.com/abcdef#mode=tour` – Ross_you Oct 29 '20 at 18:47
  • Could you perhaps provide an exemplar URL where one can actually test this? – Dunois Oct 29 '20 at 18:48
  • @Dunois, you can check `http://www.pixilink.com/69964` as an example. Once you enter this on your browser, it will change to `https://www.pixilink.com/69964#mode=tour` which shows it has video – Ross_you Oct 29 '20 at 18:51
  • Could you also provide an example of `#mode=0`? – Dunois Oct 29 '20 at 19:00
  • @Dunois, simply change it to another number. example `https://www.pixilink.com/93313` which will change to `https://www.pixilink.com/93313#mode=0` – Ross_you Oct 29 '20 at 19:08

1 Answers1

1

So I have a rather simple solution for this.

Inspecting the URLs provided by the OP (e.g., https://www.pixilink.com/93313) indicates that the #mode= default value is provided by the variable initial_mode = in an embedded javascript. So to establish whether a URL will default to "picture" (#mode=0) or video (#mode=tour) can be accomplished by investigating the value assigned to this variable.

#Function to get the value of initial_mode from the URL
urlmode <- function(x){
  mycontent <- readLines(x)
  mypos <- grep("initial_mode = ", mycontent)
  
  if(grepl("0", mycontent[mypos])){
    cat("\n", x, "has default initial_mode picture: #mode=0 \n")
    return("picture")
  } else if(grepl("tour", mycontent[mypos])){
    cat("\n", x, "has default initial_mode video: #mode=tour \n")
    return("video")
  } else{
    cat("\n", x, "is an invalid URL. \n")
    return("invalid")
  }
}


#Example URLs to demonstrate functionality
myurl1 <- "https://www.pixilink.com/93313"
myurl2 <- "https://www.pixilink.com/69964"


urlmode(myurl1)
#
# https://www.pixilink.com/93313 has default initial_mode picture: #mode=0 
#[1] "picture"
#Warning message:
#In readLines(x) :
#  incomplete final line found on 'https://www.pixilink.com/93313'
#

urlmode(myurl2)
#
# https://www.pixilink.com/69964 has default initial_mode video: #mode=tour 
#[1] "video"
#Warning message:
#In readLines(x) :
#  incomplete final line found on 'https://www.pixilink.com/69964'

Needless to say this is an extremely simplistic function that will (most likely) fail all but the ideal (sub)set of cases. But it's a start.

Dunois
  • 1,813
  • 9
  • 22
  • Wow, thanks. For my reference, can I ask in which part of this code, you are actually checking `initial_mode`? To my understanding, the difference comes from this variable; however, I do not see it in the code. Also, is there any way to improve the code in a way that it first checks if the URL is working or not? – Ross_you Oct 29 '20 at 19:39
  • 1
    So `readLines()` gets the URL's content, `grep()`ing for `initial_mode = ` finds the line number where `initial_mode` is located, and the `grepl()`s inside the `if/else` chain check whether it is a `0` (picture) or `tour` that's on that particular line. If you want to check if the URL works or not, you could have a `httr::GET()` or something of that sort wrapped inside a `tryCatch()` as the first line of the function (or have the function inside `tryCatch()`). – Dunois Oct 29 '20 at 19:44
  • Here: `mypos <- grep("initial_mode = ", mycontent)` and then here: `if(grepl("0", mycontent[mypos]))` (for instance). And yes, that snippet you've indicated will state that the URL is "invalid" if the `initial_mode = ` tests fail, but the function does not check whether the string input to the function in of itself is a valid URL. (So it's not really robust in this sense, and also the function will fail with an input like `google.com` for instance because it doesn't actually check to see if `readLines()` can handle the URL.) – Dunois Oct 29 '20 at 19:52
  • can we process a limited number of URLs in a day? the reason I am asking this is that the code works perfectly when I combined it with tryCatch on 100 sample URLs, but after I ran it for 4500 URLs, I got error message says: `Error in file(con, "r") : all connections are in use ` and now I started wondering that I can process a limited number of URLs in a day. is it correct ? or there is aother problem here? – Ross_you Oct 30 '20 at 04:43
  • There's no limit to it, at least not from the user's side in this context. You probably need to just invoke `closeAllConnections()` after each function call, as it's likely that you're hitting the limit for how many functions can be open concurrently. I don't know what your code looks like, but if you have `tryCatch()` set up properly, I'd presume this is the only problem. – Dunois Oct 30 '20 at 09:35
  • unfortunately, even with using `closeAllConnection` I get this error `Error in file(con, "r") : cannot open the connection`. I explained my code in detail here: https://stackoverflow.com/questions/64602664/how-to-resolve-error-in-filecon-r-all-connections-are-in-use it would be great if you can comment on it – Ross_you Oct 30 '20 at 17:37