0

I have a real-time url "linktoAPI" containing a list of nested dictionary. I have tried many solutions from the links below, but none of them helps me achieve what I want. Apparently, it is because the value of the nested dictionary (key: "history_value") is ununiform and contain long string object.

Extract dictionary value from column in data frame (got the error: 'get' does not applicable for string)

How to convert list of nested dictionary to pandas DataFrame? (got the error:'str' object is not a mapping)

Split / Explode a column of dictionaries into separate columns with pandas,

https://www.skytowner.com/explore/splitting_dictionary_into_separate_columns_in_pandas_dataframe

from urllib.request import urlopen
import pandas as pd
import json 

url = "LINKTOAPI"
response = urlopen(url)
data_json = json.loads(response.read())
baoluu = pd.DataFrame(data_json)
display(baoluu.head())

Run this code and you can see that the column "history_value" contain an ununiform dictionary. I want this columns to be separated into "ngaybaoluu", "ngayhoclai" and "lydo". Please help me. Thank you very much!!!

thanh pham
  • 61
  • 9
  • Hi, I've managed to get the "lydo" column, but couldn't find any "ngaybaoluu" or "ngayhoclai" fields in the response. Could you please describe how to determine these two fields from the url? – Brandon Jun 22 '22 at 04:24
  • @Brandon Hi, the "ngaybaoluu" and "ngayhoclai" only available in recent data. That means "created_at" > "2022-01-01" – thanh pham Jun 22 '22 at 04:46

1 Answers1

1

If I get it right, this should work:

import json
import requests
import pandas as pd

req = requests.get('https://office.ieltsvietop.vn/api/get_data/history')
req_json = req.json()

df = pd.DataFrame(json.loads(r['history_value']) for r in req_json)

this df should be like

     request_id ketoan_id lop_id  ... danhsachcho chinhanh_old chinhanh
0            11      2470    551  ...         NaN          NaN      NaN
1            13      2474    551  ...         NaN          NaN      NaN
2            12      2468    564  ...         NaN          NaN      NaN
3            15      2338    442  ...         NaN          NaN      NaN
4            31      2463    239  ...         NaN          NaN      NaN
...         ...       ...    ...  ...         ...          ...      ...
5256       4699      4357    NaN  ...         NaN          NaN      NaN
5257       4695      3787    NaN  ...         NaN          NaN      NaN
5258       4679      4716    NaN  ...         NaN          NaN      NaN
5259       4694      4114    596  ...         NaN          NaN      NaN
5260       4705      4839    601  ...         NaN          NaN      NaN

[5261 rows x 20 columns]

then we select the needed columns ngaybaoluu, ngayhoclai and lydo with

df = df[['ngaybaoluu', 'ngayhoclai', 'lydo']]

the final df is

      ngaybaoluu  ngayhoclai                                               lydo
0            NaN         NaN                   Bạn phù hợp với trình độ của lớp
1            NaN         NaN  Bạn cần lấy target để ra trường và phục vụ côn...
2            NaN         NaN              Học viên đăng kí học Speaking-express
3            NaN         NaN            Vt3 có lớp phù hợp với trình độ của bạn
4            NaN         NaN                                                NaN
...          ...         ...                                                ...
5256  22-06-2022  01-08-2022  Học viên tập trung ôn thi THPTQG. Học viên đã ...
5257  21-06-2022  21-08-2022  Học viên chưa sắp xếp được lịch học lại . Học ...
5258  21-06-2022  15-07-2022  Học viên đi  tập quân sự. Học viên đã hiểu rõ ...
5259         NaN  22-06-2022                                                NaN
5260         NaN         NaN                                                NaN

[5261 rows x 3 columns]

Be aware that many of the columns have null values in them, which means the original response of the url does not contain these fields, so it's fine. If you want to fill these null values, you can look up to .fillna().

Brandon
  • 708
  • 6
  • 13
  • 1
    The takeaway here is that the `history_value` field is a string field exported from a json object. So we need to load it as dict before doing any dataframe stuff – Brandon Jun 22 '22 at 05:06
  • Thank you @Bradon, it works like a charm. However, when I try to reproduce your code with another similar nested dictionary (sorry i can't share url link with you because it contains personal information) I receive an error "the JSON object must be str, bytes or bytearray, not NoneType". Apparently, I need to do some conversion before doing any dataframe stuff. Could you please help me on this case. Thank you!! – thanh pham Jun 22 '22 at 07:21
  • 1
    @thanhpham Happy to help! For `None` elements may present in the response, simply add an check to skip the line. e.g. `df = pd.DataFrame(json.loads(r.get('history_value')) for r in req_json if r.get('history_value'))`. So the lines without a `history_value` field or with an empty `history_value` field should be skipped. – Brandon Jun 22 '22 at 07:53
  • 1
    Aha, table joining is fairly straightforward using pandas. In the situation you mentioned above, it should be something like `df_merged = df.merge(history_value, on='ketoan_id', how='left')`. You can checkout more examples and detailed explanations at the official doc for [`DataFrame.merge`](https://pandas.pydata.org/pandas-docs/version/1.4.0/reference/api/pandas.merge.html). But that is another topic with regards to `merge` in pandas, I think you could post another question for specific help. – Brandon Jun 23 '22 at 04:02
  • @Brason I have posted a new question with the detail of my code and description of my problem here **https://stackoverflow.com/questions/72727407/missing-column-value-when-merge-tables-in-python**, I would be very thankful, if you could help. Thank you – thanh pham Jun 23 '22 at 09:03