Pandas table scrape

Question

I am trying to figure the best approach of converting a table into JSON records. At present I have the output as desired however the format of the table is puzzling me a little. The example below should explain:

ID   Product        Item_Material   Owner           Interest %
123  Test Item 1    Electric        Elctrotech              60%
null null           null            Spark inc               40%
124  Test Item 2    Wood            TY Toys                 100%
125  Test Item 3    Plastic         NA Materials            100%

My new line JSON is what I want but I am looking to somehow achieve the nested table rows into a nested JSON format if part of the parent row.

{"ID":"Test Item 1", "Item_Material":"Electric", "Owner":"Elctrotech","Interest %":"60%"}
{"ID":null, "Item_Material":null, "Owner":"Spark inc","Insterest %":"40%"} 
{"ID":"Test Item 2", "Item_Material":"Wood", "Owner":"TY Toys","Insterest %":"100%"}
{"ID":"Test Item 3","Item_Material":"Plastic","Owner":"NA Materials","Interest %":"100%"}

The aim would be to have the first row JSON something like this?

{"ID":"Test Item 1", "Item_Material":"Electric", "Owners": [{"Owner": "Elctrotech", "Interest %":"60%", "Owner":"Spark inc","Interest %":"40%"}]}

The data originates from a scraped table using Beautiful Soup, the rows in the table I have provided are all in separate <tr> tags so when pulled into a pandas dataframe it is presented this way. I dont know if there is functionality to even merge in pandas to the row above so I can have one JSON record per 'Product'. Sometimes there can be multiple 'Owners' per item not just 2.

`JSON` can't have `Owner` and `Interest` two times in one `{}` — furas, Dec 18 '17 at 14:54
if you had correct values instead of `null` in columns `Product` and `Item_Material` tthen you could atleast group elements using `group_by()` and maybe then you could easier save it as JSON. — furas, Dec 18 '17 at 14:57
I noticed the `{}` error thanks. Yeah I just thought of that approach, you know of a way to replace any `null` value with the previous row value? I reckon that would work to allow `groupby` to tidy it up — Chris, Dec 18 '17 at 15:01
this works with `NaN` data but may not works with `null` - [How to replace NaNs by preceding values in pandas DataFrame?](https://stackoverflow.com/questions/27905295/how-to-replace-nans-by-preceding-values-in-pandas-dataframe).But you can replace `null` with `NaN` first. — furas, Dec 18 '17 at 15:04
Thanks @furas think I'll manage to figure it out now! Just needed a moment of clarity. — Chris, Dec 18 '17 at 15:29

score 0 · Answer 1 · answered Dec 18 '17 at 15:41

The output dict line is not the same that you expected, but your dict sintax was wrong. Try this. Only with Pandas

p=[[123,"Test Item 1","Electric","Elctrotech","60%"], [124,"Test Item 2","Wood"," TY Toys","100%"],[125,"Test Item 1","Plastic","NA Materials","100%"], [123,"Test Item 1","Foo","Bar","80%"], [123,"Test Item 1","Electric","TRY TRY TRY","70%"]]

x=pd.DataFrame(p, columns=["ID","Product","Item_Material","Owner","Interest %"])

d=dict(ID="", Item_Material="", Owners={"Owner":[], "Interest %":[]})
x_gb=x.groupby(["Product", "Item_Material"])
grouped_Series_Owner = x_gb["Owner"].apply(list).to_dict()
grouped_Series_Interest = x_gb["Interest %"].apply(list).to_dict()
for k in out.keys():
    d["Item_Material"]=out[k]["Item_Material"]
    d["ID"]=out[k]["Product"]
    d["Owners"]["Owner"]= grouped_Series_Owner[(out[k]["Product"], out[k]["Item_Material"])]
    d["Owners"]["Interest %"]= grouped_Series_Interest[(out[k]["Product"], out[k]["Item_Material"])]
    print(d)

Pandas table scrape

1 Answers1

Linked