I'm trying to extract data from a csv to a JSON file. The csv has several columns but I need only col1, col2, col3. I have been playing around with pandas and trying to get it to work but I can't figure out how to eliminate the other columns and get just col1,col2 and col3. I know that running the iteraterrows for pandas goes through all the rows and which is causing to get all the columns, I tried iloc but didn't get to the proper output.
My code so far
import pandas as pd
import pdb
from itertools import groupby
from collections import OrderedDict
import json
df = pd.read_csv('test_old.csv', dtype={
"col1" : str,
"col2" : str
})
results = []
for (col1), bag in df.groupby(["col1"]):
contents_df = bag.drop(["col1"], axis=1)
labels = [OrderedDict(row) for i,row in contents_df.iterrows()]
pdb.set_trace()
results.append(OrderedDict([("col1", col1),
("subset", labels)]))
print json.dumps(results[0], indent=4)
with open('ExpectedJsonFile.json', 'w') as outfile:
outfile.write(json.dumps(results, indent=4))
The CSV
col1,col2,state,col3,val2,val3,val4,val5
95110,2015-05-01,CA,50,30.00,5.00,3.00,3
95110,2015-06-01,CA,67,31.00,5.00,3.00,4
95110,2015-07-01,CA,97,32.00,5.00,3.00,6
The expected JSON
{
"col1": "95110",
"subset": [
{
"col2": "2015-05-01",
"col3": "50",
},
{
"col2": "2015-06-01",
"col3": "67",
},
{
"col2": "2015-07-01",
"col3": "97",
}
]
}