In Python, how to parse and organize information from an API with different structures on each link?

Question

In python3 and pandas I use requests to capture information from a public API. That way:

import requests
import pandas as pd

headers = {"Accept" : "application/json"}

#Example link
url = 'http://legis.senado.leg.br/dadosabertos/materia/votacoes/137178'

projetos_vot = []

try:
    r = requests.get(url, headers=headers)
except requests.exceptions.HTTPError as errh:
    print ("Http Error:",errh)
except requests.exceptions.ConnectionError as errc:
    print ("Error Connecting:",errc) 
except requests.exceptions.Timeout as errt:
    print ("Timeout Error:",errt)
except requests.exceptions.RequestException as err:
    print ("OOps: Something Else",err)

projects = r.json()

try:
    CodigoMateria = str(projects['VotacaoMateria']['Materia']['IdentificacaoMateria']['CodigoMateria'])
except KeyError:
    CodigoMateria = None                
except TypeError:
    CodigoMateria = None

try:
    SiglaCasaIdentificacaoMateria = str(projects['VotacaoMateria']['Materia']['IdentificacaoMateria']['SiglaCasaIdentificacaoMateria'])
except KeyError:
    SiglaCasaIdentificacaoMateria = None                
except TypeError:
    SiglaCasaIdentificacaoMateria = None

try:
    NomeCasaIdentificacaoMateria = str(projects['VotacaoMateria']['Materia']['IdentificacaoMateria']['NomeCasaIdentificacaoMateria'])
except KeyError:
    NomeCasaIdentificacaoMateria = None                
except TypeError:
    NomeCasaIdentificacaoMateria = None

try:
    SiglaSubtipoMateria = str(projects['VotacaoMateria']['Materia']['IdentificacaoMateria']['SiglaSubtipoMateria'])
except KeyError:
    SiglaSubtipoMateria = None                
except TypeError:
    SiglaSubtipoMateria = None

try:
    DescricaoSubtipoMateria = str(projects['VotacaoMateria']['Materia']['IdentificacaoMateria']['DescricaoSubtipoMateria'])
except KeyError:
    DescricaoSubtipoMateria = None                
except TypeError:
    DescricaoSubtipoMateria = None

try:
    NumeroMateria = str(projects['VotacaoMateria']['Materia']['IdentificacaoMateria']['NumeroMateria'])
except KeyError:
    NumeroMateria = None                
except TypeError:
    NumeroMateria = None

try:
    AnoMateria = str(projects['VotacaoMateria']['Materia']['IdentificacaoMateria']['AnoMateria'])
except KeyError:
    AnoMateria = None                
except TypeError:
    AnoMateria = None

try:
    DescricaoObjetivoProcesso = str(projects['VotacaoMateria']['Materia']['IdentificacaoMateria']['DescricaoObjetivoProcesso'])
except KeyError:
    DescricaoObjetivoProcesso = None                
except TypeError:
    DescricaoObjetivoProcesso = None

try:
    DescricaoIdentificacaoMateria = str(projects['VotacaoMateria']['Materia']['IdentificacaoMateria']['DescricaoIdentificacaoMateria'])
except KeyError:
    DescricaoIdentificacaoMateria = None                
except TypeError:
    DescricaoIdentificacaoMateria = None

try:
    IndicadorTramitando = str(projects['VotacaoMateria']['Materia']['IdentificacaoMateria']['IndicadorTramitando'])
except KeyError:
    IndicadorTramitando = None                
except TypeError:
    IndicadorTramitando = None

# This item (Votacoes) does not have a pattern of the same number of items on each link, so I capture everything
try:
    Votacoes = str(projects['VotacaoMateria']['Materia']['Votacoes'])
except KeyError:
        Votacoes = None
except TypeError:
        Votacoes = None


dicionario = {"CodigoMateria": CodigoMateria,
        "SiglaCasaIdentificacaoMateria": SiglaCasaIdentificacaoMateria,
        "NomeCasaIdentificacaoMateria": NomeCasaIdentificacaoMateria,
        "SiglaSubtipoMateria": SiglaSubtipoMateria,
        "DescricaoSubtipoMateria": DescricaoSubtipoMateria,
        "NumeroMateria": NumeroMateria,
        "AnoMateria": AnoMateria,
        "DescricaoObjetivoProcesso": DescricaoObjetivoProcesso,
        "DescricaoIdentificacaoMateria": DescricaoIdentificacaoMateria,
        "IndicadorTramitando": IndicadorTramitando,
        "Votacoes": Votacoes
        }


projetos_vot.append(dicionario)

df_projetos_vot = pd.DataFrame(projetos_vot)

df_projetos_vot.reset_index()
df_projetos_vot.info()

<class 'pandas.core.frame.DataFrame'>                                    
RangeIndex: 1 entries, 0 to 0
Data columns (total 11 columns):
CodigoMateria                    1 non-null object
SiglaCasaIdentificacaoMateria    1 non-null object
NomeCasaIdentificacaoMateria     1 non-null object
SiglaSubtipoMateria              1 non-null object
DescricaoSubtipoMateria          1 non-null object
NumeroMateria                    1 non-null object
AnoMateria                       1 non-null object
DescricaoObjetivoProcesso        1 non-null object
DescricaoIdentificacaoMateria    1 non-null object
IndicadorTramitando              1 non-null object
Votacoes                         1 non-null object
dtypes: object(11)
memory usage: 216.0+ bytes

Then item (Votacoes) needs to be parsed. It looks like this:

{'Votacao': [{'CodigoSessaoVotacao': '3768', 'SessaoPlenaria': {'CodigoSessao': '23', 'SiglaCasaSessao': 'SF', 'NomeCasaSessao': 'Senado Federal', 'CodigoSessaoLegislativa': '9', 'SiglaTipoSessao': 'ORD', 'NumeroSessao': '6', 'DataSessao': '1995-02-22', 'HoraInicioSessao': '14:30:00'}, 'Tramitacao': {'IdentificacaoTramitacao': {'CodigoTramitacao': '269445', 'NumeroAutuacao': '1', 'DataTramitacao': '1995-02-22', 'NumeroOrdemTramitacao': '1', 'TextoTramitacao': 'VOTAÇÃO APROVADO O PROJETO.                               \n      ', 'IndicadorRecebimento': 'S', 'OrigemTramitacao': {'Local': {'CodigoLocal': '153', 'TipoLocal': 'A', 'SiglaCasaLocal': 'SF', 'NomeCasaLocal': 'Senado Federal', 'SiglaLocal': 'ATA-PLEN', 'NomeLocal': 'SUBSECRETARIA DE ATA - PLENÁRIO'}}, 'DestinoTramitacao': {'Local': {'CodigoLocal': '143', 'TipoLocal': 'A', 'SiglaCasaLocal': 'SF', 'NomeCasaLocal': 'Senado Federal', 'SiglaLocal': 'MESA', 'NomeLocal': 'MESA DIRETORA'}}}}, 'IndicadorVotacaoSecreta': 'Não', 'DescricaoVotacao': 'Projeto de Decreto Legislativo nº 39 de 1994', 'DescricaoResultado': 'Aprovado', 'Votos': {'VotoParlamentar': [{'IdentificacaoParlamentar': {'CodigoParlamentar': '59', 'NomeParlamentar': 'Marina Silva', 'NomeCompletoParlamentar': 'Maria Osmarina Marina Silva Vaz de Lima', 'SexoParlamentar': 'Feminino', 'FormaTratamento': 'Senadora ', 'UrlFotoParlamentar': 'http://www.senado.leg.br/senadores/img/fotos-oficiais/senador59.jpg', 'UrlPaginaParlamentar': 'http://www25.senado.leg.br/web/senadores/senador/-/perfil/59', 'EmailParlamentar': 'marinasi@senado.leg.br', 'SiglaPartidoParlamentar': 'PT', 'UfParlamentar': 'AC'}, 'SiglaVoto': 'Abstenção'},...

As I said above in the script, item (Votacoes) can have different structures on each link - number of columns or amount of data.

Please is there a more efficient way to parse this kind of information?

Also better to organize it in a dataframe? Or is it better to break into multiple dataframes, each with a unique key from each link?

Edited on 12/20/2019

More details of the item "Votacoes" to try to further explain this question.

It is information about parliamentary votes, with the votes of senators

If you open the link or link or link in a Chrome browser for example you will see more examples of how it is formed

They are very sublevel of information, with various keys and data. Also the number of keys may vary from link to link

It is different from the items that are in 'IdentificacaoMateria', simpler and without sublevels, so it's easy to think of a dataframe structure.

1 - My question then is if there is a way to read all the keys that exist in "Votacoes" and automate the creation of a dataframe

2 - Or if I have to predict all possible key conditions to capture the information and then do the dataframe

3 - Also, as this is a complex data structure, I want an opinion as to whether the conventional dataframe strategy would really be the best or could use otherwise

For example, the current dataframe generated this file.

I thought I'd use the unique key of each poll, "CodigoMateria", to index the dataframe. Then a search with the unique key would return the dictionary contained in "Votacoes" And this dictionary would be used to show information in an application

Edited on 12/21/2019

I followed the directions below that @wowkin2 gave and did so:

import requests
import pandas as pd
import collections

# Function to read all keys
def get_by_key(key, value):
    try:
        if '.' in key:
            old_key, new_key = key.split('.', 1)
            new_value = value[old_key]
            return get_by_key(new_key, new_value)
        else:
            return value[key]
    except (KeyError, TypeError) as _:
        return None

# Function to flatten nested dictionaries
def flatten(d, parent_key='', sep='_'):
    items = []
    for k, v in d.items():
        new_key = parent_key + sep + k if parent_key else k
        if isinstance(v, collections.MutableMapping):
            items.extend(flatten(v, new_key, sep=sep).items())
        else:
            items.append((new_key, v))
    return dict(items)

headers = {"Accept": "application/json"}

# This is a dataframe with multiple voting links
# This in column "url_votacoes_materia"
df_projetos_det.info()

# Marks the beginning of the iteration in df_projetos_det
conta = 0

for num, row in df_projetos_det.iterrows():
    projetos_votos = []
    projects = {}

    url = row['url_votacoes_materia']
    print(url)

    try:
        r = requests.get(url, headers=headers)
        projects = r.json()
    except requests.exceptions.RequestException as e:
        print("Requests exception: {}".format(e))

    dicionario = {
        "CodigoMateria": get_by_key('VotacaoMateria.Materia.IdentificacaoMateria.CodigoMateria', projects),
        "SiglaCasaIdentificacaoMateria": get_by_key('VotacaoMateria.Materia.IdentificacaoMateria.SiglaCasaIdentificacaoMateria', projects),
        "NomeCasaIdentificacaoMateria": get_by_key('VotacaoMateria.Materia.IdentificacaoMateria.NomeCasaIdentificacaoMateria', projects),
        "SiglaSubtipoMateria": get_by_key('VotacaoMateria.Materia.IdentificacaoMateria.SiglaSubtipoMateria', projects),
        "DescricaoSubtipoMateria": get_by_key('VotacaoMateria.Materia.IdentificacaoMateria.DescricaoSubtipoMateria', projects),
        "NumeroMateria": get_by_key('VotacaoMateria.Materia.IdentificacaoMateria.NumeroMateria', projects),
        "AnoMateria": get_by_key('VotacaoMateria.Materia.IdentificacaoMateria.AnoMateria', projects),
        "DescricaoObjetivoProcesso": get_by_key('VotacaoMateria.Materia.IdentificacaoMateria.DescricaoObjetivoProcesso', projects),
        "DescricaoIdentificacaoMateria": get_by_key('VotacaoMateria.Materia.IdentificacaoMateria.DescricaoIdentificacaoMateria', projects),
        "IndicadorTramitando": get_by_key('VotacaoMateria.Materia.IdentificacaoMateria.IndicadorTramitando', projects),
        "Votacoes": get_by_key('VotacaoMateria.Materia.Votacoes', projects),
    }

    projetos_votos.append(dicionario)

    if conta == 0:
        df_projetos_votos = pd.DataFrame(projetos_votos)
    else:
        df_projetos_votos_aux = pd.DataFrame(projetos_votos)
        df_projetos_votos = df_projetos_votos.append(df_projetos_votos_aux)

    conta = conta + 1

df_projetos_votos.info()


# Marks the beginning of the iteration in df_projetos_votos
conta = 0

for num, row in df_projetos_votos.iterrows():
    # I capture the unique code of the proposition that was voted or not
    CodigoMateria = row['CodigoMateria']
    Votacoes = row['Votacoes']

    # Tests if the proposition has already had a vote
    if Votacoes is not None:
        votos = flatten(Votacoes)

        df = pd.DataFrame(votos)
        # Add column with unique code
        df['CodigoMateria'] = CodigoMateria 

        if conta == 0:
           df_procura1 = df
        else:
            df_procura1 = df_procura1.append(df)

        conta = conta + 1

# Created a dataframe with the voting dictionary and its unique proposition codes
df_procura1.info()

What do you mean by " item (Votacoes) needs to be parsed"? Given your sample "Votacao", what's your desired output? — Jack Fleeting, Dec 20 '19 at 02:08
This code is painfully repetitive. You can condense the exception clauses: `except (KeyError, TypeError): foo = None`. If you're printing exception messages you can maintain a dict `err_dict = {ExceptionName: 'message'}` and access it: `except Error as e: print(err_dict.get(e, 'Unknown'))`... You can probably also do something a lot more succinct than these try-except waterfalls, but I'll let you think about that. — cs95, Dec 20 '19 at 06:57
Thank you very much @Jack Fleeting, I did a more detail above — Reinaldo Chaves, Dec 20 '19 at 17:12
Now that I've looked at this pages, it seems to me that your problem in not so much extracting the data, but really how to present the extracted data. I don't think this is appropriate for a dataframe (or an Excel sheet). So you have to solve the presentation design issue first, I believe. — Jack Fleeting, Dec 20 '19 at 20:05
Thanks @Jack Fleeting. Yes, I have to think about the presentation. But I am inclined, so far, to resolve this by creating what I said in the last sentence: — Reinaldo Chaves, Dec 20 '19 at 21:30
Create an auxiliary dataframe with column "CodigoMateria" and all other columns that could appear in "Votacoes". Do you think there is a way to automate this? Or would it even foresee all column possibilities and fill them with data when they exist? — Reinaldo Chaves, Dec 20 '19 at 21:31
So, does the everything you want is to make flat structure from multi-level dict for dataframe? — wowkin2, Dec 20 '19 at 21:37
Yes @wowkin2 but as I wrote it has many sublevels and some not all data will be present (in all keys) — Reinaldo Chaves, Dec 20 '19 at 21:47

wowkin2 · Accepted Answer · 2019-12-20T22:13:31.843

If you want to make dict structure flat and use in dataframe - you can use example from similar question about Flatten nested dictionaries. Result will be a dict that can be easily converted. If some fields are missing in few objects - dataframe will contain null values there.

import collections

def flatten(d, parent_key='', sep='_'):
    items = []
    for k, v in d.items():
        new_key = parent_key + sep + k if parent_key else k
        if isinstance(v, collections.MutableMapping):
            items.extend(flatten(v, new_key, sep=sep).items())
        else:
            items.append((new_key, v))
    return dict(items)

>>> flatten({'a': 1, 'c': {'a': 2, 'b': {'x': 5, 'y' : 10}}, 'd': [1, 2, 3]})
{'a': 1, 'c_a': 2, 'c_b_x': 5, 'd': [1, 2, 3], 'c_b_y': 10}

.

Originally (before edit on 12/20/2019),
I thought that, you want manually extract some keys and build structure. So I thought that you can try to define your structure using dots like VotacaoMateria.Materia.IdentificacaoMateria.CodigoMateria to generate your dict for Pandas Dataframe

import requests
import pandas as pd

headers = {"Accept": "application/json"}

# Example link
url = 'http://legis.senado.leg.br/dadosabertos/materia/votacoes/137178'

projetos_vot = []
projects = {}

try:
    r = requests.get(url, headers=headers)
    projects = r.json()
except requests.exceptions.RequestException as e:
    print("Requests exception: {}".format(e))


def get_by_key(key, value):
    try:
        if '.' in key:
            old_key, new_key = key.split('.', 1)
            new_value = value[old_key]
            return get_by_key(new_key, new_value)
        else:
            return value[key]
    except (KeyError, TypeError) as _:
        return None


dicionario = {
    "CodigoMateria": get_by_key('VotacaoMateria.Materia.IdentificacaoMateria.CodigoMateria', projects),
    "SiglaCasaIdentificacaoMateria": get_by_key('VotacaoMateria.Materia.IdentificacaoMateria.SiglaCasaIdentificacaoMateria', projects),
    "NomeCasaIdentificacaoMateria": get_by_key('VotacaoMateria.Materia.IdentificacaoMateria.NomeCasaIdentificacaoMateria', projects),
    "SiglaSubtipoMateria": get_by_key('VotacaoMateria.Materia.IdentificacaoMateria.SiglaSubtipoMateria', projects),
    "DescricaoSubtipoMateria": get_by_key('VotacaoMateria.Materia.IdentificacaoMateria.DescricaoSubtipoMateria', projects),
    "NumeroMateria": get_by_key('VotacaoMateria.Materia.IdentificacaoMateria.NumeroMateria', projects),
    "AnoMateria": get_by_key('VotacaoMateria.Materia.IdentificacaoMateria.AnoMateria', projects),
    "DescricaoObjetivoProcesso": get_by_key('VotacaoMateria.Materia.IdentificacaoMateria.DescricaoObjetivoProcesso', projects),
    "DescricaoIdentificacaoMateria": get_by_key('VotacaoMateria.Materia.IdentificacaoMateria.DescricaoIdentificacaoMateria', projects),
    "IndicadorTramitando": get_by_key('VotacaoMateria.Materia.IdentificacaoMateria.IndicadorTramitando', projects),
    "Votacoes": get_by_key('VotacaoMateria.Materia.Votacoes', projects),
}


projetos_vot.append(dicionario)

df_projetos_vot = pd.DataFrame(projetos_vot)

df_projetos_vot.reset_index()
df_projetos_vot.info()

# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 1 entries, 0 to 0
# Data columns (total 11 columns):
# AnoMateria                       1 non-null object
# CodigoMateria                    1 non-null object
# DescricaoIdentificacaoMateria    1 non-null object
# DescricaoObjetivoProcesso        1 non-null object
# DescricaoSubtipoMateria          1 non-null object
# IndicadorTramitando              1 non-null object
# NomeCasaIdentificacaoMateria     1 non-null object
# NumeroMateria                    1 non-null object
# SiglaCasaIdentificacaoMateria    1 non-null object
# SiglaSubtipoMateria              1 non-null object
# Votacoes                         1 non-null object
# dtypes: object(11)
# memory usage: 160.0+ bytes
#
# Process finished with exit code 0

print(df_projetos_vot.head())

#   AnoMateria CodigoMateria DescricaoIdentificacaoMateria DescricaoObjetivoProcesso  ... NumeroMateria SiglaCasaIdentificacaoMateria SiglaSubtipoMateria                                           Votacoes
# 0       2019        137178                   PEC 91/2019                  Revisora  ...         00091                            SF                 PEC  {u'Votacao': [{u'DescricaoVotacao': u'Proposta...

Thank you very much @wowkin2 I edited my question to better explain my need — Reinaldo Chaves, Dec 20 '19 at 21:33
@ReinaldoChaves updated answer - please have a look and let me know if that's it :) — wowkin2, Dec 20 '19 at 22:13
Thanks so much @wowkin2 I edited my question above and put the solution I found. If you have more suggestions let me know — Reinaldo Chaves, Dec 21 '19 at 17:18
@ReinaldoChaves Looks like my answer was useful, so you combined my examples together and have the result. Please mark my answer accepted, if you don’t have anything remaining for this question. And you are always welcome to create new questions. — wowkin2, Dec 21 '19 at 18:39

In Python, how to parse and organize information from an API with different structures on each link?

1 Answers1