0

I am passing an id value to a API url to get JSON response, but getting only one response and rest all are throwing 500 errors. I collect the ids in a list and pass the id to API URL as a parameter in a while loop to extract the data.

###Get id in a variable##     
                                                                                                                                                               
df_filter=spark.sql("""select distinct ID from filter_view""")

rdd = df_filter.rdd
listOfRows = rdd.collect()
counter = 0
##total_results = []
while counter < len(listOfRows):
    url += '?ids=' + listOfRows[counter].ID 
    response = requests.get(url,headers=headers)
       
    if response.status_code == 200:
        json_response = response.json()
        ##total_results.append(json_response)
        df2 = pd.json_normalize(json_response, record_path=['entities'])
        display(df2)
        
    else:
        print("Error: HTTP status code " + str(response.status_code))
    counter +=1

I am getting output for only one ID and rest all end with 500 errors.

Desired output:

ID---ItemID--Details
1    100      text
1    101      text
2    200      text
2    300      text
3    400      sometext
3    500      sometext
   

Output I am getting:

ID---ItemID--Details
1    100     text    
1    101     text
Error: HTTP status code 500
Error: HTTP status code 500
Error: HTTP status code 500
Error: HTTP status code 500
Error: HTTP status code 500
Error: HTTP status code 500
Michael Ruth
  • 2,938
  • 1
  • 20
  • 27
Arun.K
  • 103
  • 2
  • 4
  • 21
  • Do the failing responses have anything in the response body which provides more information about the error? Otherwise, you're out of luck unless you have access to the remote host's logs, HTTP status 500 is an internal server error. – Michael Ruth May 02 '23 at 17:16
  • Oh, wait, you're concatenating `'?ids=...'` each iteration. Don't do this. Try `response = requests.get(url + '?ids=' + listOfRows[counter].ID, headers=headers)`. – Michael Ruth May 02 '23 at 17:19
  • It would also be nice to have `url` defined in the question, in the interest of a [mre] – Michael Ruth May 02 '23 at 17:47
  • @MichaelRuth - i cannot provide the actual URL as it is vendor licensed – Arun.K May 03 '23 at 03:42

1 Answers1

1

The first iteration produces a valid URL: baseURL/?ids=1, but since it's built using concatenation and assignment, the second iteration produces baseURL/?ids=1?ids=2 when you want baseURL/?ids=2.

while counter < len(listOfRows):
    response = requests.get(f'{url}?ids={listOfRows[counter].ID}', headers=headers)

Does the API support GETting multiple resources in a single request? Typically, with a plural query parameter like ids, it will take either a comma-separated list of resource IDs (?ids=1,2,3) or an array (?ids[]=1&ids[]=2&ids[]=3, or ?ids=1&ids=2&ids=3). If so, it will be way more efficient, and more polite to the API provider, to make one such request.

response = requests.get(
    url + '?ids=' + ','.join([row.ID for row in listOfRows]),
    headers=headers
)

You'll probably need to change the code to parse the new response.

If multiple GET isn't supported, at least convert this to a for-loop. There's no need to keep track of counter and test counter < len(listOfRows), and it will improve readability.

df_filter=spark.sql("""select distinct ID from filter_view""")

rdd = df_filter.rdd
listOfRows = rdd.collect()
for row in listOfRows:
    response = requests.get(f'{url}?ids={row.ID}', headers=headers)
       
    if response.status_code == 200:
        json_response = response.json()
        df2 = pd.json_normalize(json_response, record_path=['entities'])
        display(df2)
        
    else:
        print("Error: HTTP status code " + str(response.status_code))

Update: based on comment

i have over 5000 ids that needs to be passed one by one. How can this be passed in a chunks of 20 each may be?

Build URLs of the form ...?ids=1&ids=2&ids=3... with no more than 20 ids per URL.

from itertools import islice
def chunker(it: seq, chunksize):
    iterator = iter(it)
    while chunk := list(islice(iterator, chunksize)):
        yield chunk

for id_chunk in chunker([row.ID for row in listOfRows], 20):
    response = requests.get(
        f'{url}?ids=' + '&ids='.join(id_chunk),
        headers=headers
    )

The chunker() will split an iterable into lists with length <= chunksize. First filter listOfRows for just the IDs. Then chunk the IDs into lists of length 20. Build the URL and make the request. Thank you kafran for chunker().

Michael Ruth
  • 2,938
  • 1
  • 20
  • 27
  • Thank you @Michael Ruth i will try and update here. – Arun.K May 03 '23 at 03:01
  • URL not accepting multiple ids its throwing error code 414. So i resorted to for loop as you suggested, it works bringing data for sometime but later throws Error: HTTP status code 401. I tried with while loop also, same issue after running for few minutes bringing data it throws the error Error: HTTP status code 401 – Arun.K May 03 '23 at 03:41
  • @Arun.K, 401 status means unauthorized, the requestor hasn't supplied valid authentication credentials for the resource. If `listOfRows` is the same each time, I bet it fails on the same resource. Print the ID along with the status code to verify. – Michael Ruth May 03 '23 at 03:57
  • i checked with the vendor and they say API support multiple GET requests but it should be in this format ==> entries?Ids=&Ids=&Ids=…. and so on. Also i have over 5000 ids that needs to be passed one by one. How can this be passed in a chunks of 20 each may be? – Arun.K May 04 '23 at 02:15
  • based on your above suggestion this what i think would work ?ids=1&ids=2&ids=3) How can it be passed in the URL in a for loop? – Arun.K May 04 '23 at 03:16
  • thanks Michael for the chunker part, was not aware we could do like that. I am able to get that part right with your code and pass over 2800 IDs in 10 mins but hit a snag after 10 minutes since the token to API expires. Currently working on to pass the token and refresh token in the same for loop. If you have any links or pointer i can refer to passing tokens and refresh tokens in the loop, please let me know – Arun.K May 06 '23 at 04:03
  • The simplest way to handle token refresh is to check the response for HTTP status and/or response body. If status/body indicate the token has expired, refresh token and update the token, repeat failed request, continue. – Michael Ruth May 08 '23 at 15:08
  • It's best to write a new question regarding the token issue since it's a different problem than what's presented in this question. – Michael Ruth May 08 '23 at 15:14
  • I actually fixed the issue by generating token in a different notebook and passing that in the main notebook in a for loop. Only caveat is token generated for each 20 ids that are passed, though time consuming but it works. – Arun.K May 10 '23 at 02:59