0

I've been trying to solve this for a while, and have come to the conclusion that I just need to ask. Basically, I have a table that I extracted from a PDF, using Tabula, and I put that in a DataFrame. The ultimate goal is to convert the extracted table in a JSON format so I end up converting the DataFrame into a dictionary. But, I have trouble with the dictionary, as the table spans two pages. When I'm trying to iterate over each row of the DataFrame, it takes the very first row as column headers, rather than data, and then treats the very first "Account" value I have as a key for every single row. I pasted the output below and as you can see the very first value under the Account column I have is being used as a key in a dictionary for every single dictionary as it iterates through each row.

Instead, what I need is this structure { Account : Allocation } so that if the first row of my dataframe shows " XYZ" under the Account column and "$12.0) under the Allocation column, then it should show up as this instead : {"XYZ" : "$12.0" )

json_data_list=[]
for df in dfs_list:
    for index, row in df.iterrows:
        try:
            print(f"Row{index}:{row}")
            account_name=re.match(r'^[A-Z][a-z]+(?:\s+[A-Z][a-z]+)*',row['Account']).group(0).strip()
            allocation_value=re.match(r'\$\d+(?:\.d+)?',row['Allocation ($mm)']).group(0)
            print(f"Allocation value: {allocation_value}")

        except:
            print:(row['Account'],"error"
        if pd.notnull(account_name) and pd.notnull(allocation_value):
            json_dict={account_name:allocation_value}
            json_data_list.append(json_dict)
            print(f"Adding dictionary:{json_dict}")
This is what I get as an output:
[
   {
      "0":{
       "1":2,
       "XYZ": "ABC",
        "$12.0": "$3.0"
      },
      "1":{
      "1":3,
      "XYZ":"BCD",
      "$12.0": "$5.5"
     },
Rabiya
  • 23
  • 5

1 Answers1

0

Overall it's probably much better to consider how you can

Example DataFrame

>>> df = pd.DataFrame({"Account": ["ABC", "BCD", "XYZ", "BAD"], "Allocation ($mm)": ["$3.0", "$5.5", "$12.0", "xxx"]})
>>> df
  Account Allocation ($mm)
0     ABC             $3.0
1     BCD             $5.5
2     XYZ            $12.0
3     BAD              xxx

Data Filtering Example

>>> df["Allocation ($mm)"].str.extract('\$(\d+(?:\.\d+)?)', expand=False).astype("float64")
0     3.0
1     5.5
2    12.0
3     NaN
Name: Allocation ($mm), dtype: float64
>>> df["Allocation"] = df["Allocation ($mm)"].str.extract('\$(\d+(?:\.\d+)?)', expand=False).astype("float64")
>>> df
  Account Allocation ($mm)  Allocation
0     ABC             $3.0         3.0
1     BCD             $5.5         5.5
2     XYZ            $12.0        12.0
3     BAD              xxx         NaN

Simple to JSON

  • take only the "Account" and "Allocation" columns
  • .to_json() directly to values
>>> df[["Account", "Allocation"]].to_json(orient="values")
'[["ABC",3.0],["BCD",5.5],["XYZ",12.0],["BAD",null]]'

More complex coercion to dict
(note this will only work nicely for 2 columns, use .to_dict() to handle more even complex variants)

  • skip NaN values in Allocation
  • take only the "Account" and "Allocation" columns
  • convert to 2-column (NumPy) array
  • convert to dict (conversion of pairs to dict is easy and fast)
>>> dict(df[df["Allocation"].notna()][["Account", "Allocation"]].to_numpy())
{'ABC': 3.0, 'BCD': 5.5, 'XYZ': 12.0}

Additionally, if you're iterating in a loop, you can use continue to short-circuit the logic flow and go to the next value (row in your case)

for foo in bar:
    value = None  # wasteful, but safer
    try:
        value = foo["some key which might be in foo"]
    except KeyError:  # specific Exception
        print(f"KeyError on with {foo}")
    except Exception as ex:  # generic Exception
        print(f"unexpected Exception: {repr(ex)}")
    if value is None:
        continue  # --> next foo in bar
    # rest of logic which makes use of value
ti7
  • 16,375
  • 6
  • 40
  • 68