1

I'm trying to extract data from a csv to a JSON file. The csv has several columns but I need only col1, col2, col3. I have been playing around with pandas and trying to get it to work but I can't figure out how to eliminate the other columns and get just col1,col2 and col3. I know that running the iteraterrows for pandas goes through all the rows and which is causing to get all the columns, I tried iloc but didn't get to the proper output.

My code so far

import pandas as pd
import pdb
from itertools import groupby
from collections import OrderedDict
import json

df = pd.read_csv('test_old.csv', dtype={
            "col1" : str,
            "col2" : str
        })

results = []

for (col1), bag in df.groupby(["col1"]):
    contents_df = bag.drop(["col1"], axis=1)
    labels = [OrderedDict(row) for i,row in contents_df.iterrows()]
    pdb.set_trace()
    results.append(OrderedDict([("col1", col1),
                                ("subset", labels)]))

print json.dumps(results[0], indent=4)
with open('ExpectedJsonFile.json', 'w') as outfile:
    outfile.write(json.dumps(results, indent=4))

The CSV

col1,col2,state,col3,val2,val3,val4,val5
95110,2015-05-01,CA,50,30.00,5.00,3.00,3
95110,2015-06-01,CA,67,31.00,5.00,3.00,4
95110,2015-07-01,CA,97,32.00,5.00,3.00,6

The expected JSON

{
        "col1": "95110", 
        "subset": [
            {
                "col2": "2015-05-01",
                "col3": "50", 
            }, 
            {
                "col2": "2015-06-01", 
                "col3": "67", 
            }, 
            {
                "col2": "2015-07-01", 
                "col3": "97", 
            }
        ]

}
RinW
  • 543
  • 6
  • 13

2 Answers2

2

To keep the desired columns try this

cols_to_keep = ['col1', 'col2', 'col3']
df = df[cols_to_keep]
df

You can also read in only the columns you need like this

df = pd.read_csv('test_old.csv', usecols = ['col1', 'col2', 'col3'],   
                  dtype={"col1" : str, "col2" : str})
dernk
  • 136
  • 6
  • Thanks, I just tried out and your implementation is quite simple and less modifications to the current code as well. Works! – RinW Sep 30 '18 at 22:56
2

You can do all the grouping in pandas.

The idea behind this solution:

Create a new column subset that has the subset dictionary you want.

Group dataframe by col1 into a new data frame. Here the subset is connected to each item from col1. Extract the series subset.

Loop through this series and collect the data for your json in a list.

Convert that list to json with Python native tools.

import pandas as pd
import json

df = pd.read_csv('test_old.csv', sep=',',
       dtype={
        "col1" : str,
        "col2" : str,
        "col3" : str
    })

# print(df) - compare with example

df['subset'] = df.apply(lambda x: 
                 {'col2': x.col2,
                  'col3': x.col3 }, axis=1)

s = df.groupby('col1').agg(lambda x: list(x))['subset']

results = []

for col1, subset in s.iteritems():
    results.append({'col1': col1, 'subset': subset})

with open('ExpectedJsonFile.json', 'w') as outfile:
    outfile.write(json.dumps(results, indent=4))

UPDATE: Since there's a problem with the example, insert a print(df) line after the pd.read_csv and compare.

The imported data frame should show as:

    col1        col2 state col3  val2  val3  val4  val5
0  95110  2015-05-01    CA   50  30.0   5.0   3.0     3
1  95110  2015-06-01    CA   67  31.0   5.0   3.0     4
2  95110  2015-07-01    CA   97  32.0   5.0   3.0     6

The final result shows like this

[
    {
        "col1": "95110",
        "subset": [
            {
                "col2": "2015-05-01",
                "col3": "50"
            },
            {
                "col2": "2015-06-01",
                "col3": "67"
            },
            {
                "col2": "2015-07-01",
                "col3": "97"
            }
        ]
    }
]

Tested with Python 3.5.6 32bit, Pandas 0.23.4, Windows7

576i
  • 7,579
  • 12
  • 55
  • 92
  • Hi thanks, I tried it and for some reason the output doesn't come out as expected `{ "subset": [ { "col2": "5.00", "col3": "3" }, { "col2": "5.00", "col3": "4" }, { "col2": "5.00", "col3": "6" } ], "col1": "95110" }` – RinW Sep 30 '18 at 13:08
  • It looks to me that your csv import fails. I used your example from the clipboard, with the `pd.read_clipboard(` command, which is the same, apart from the filename. So you should check if after reading, df really contains the columns you expect. Or for a test, copy your example to the clipboard and replace the `read_csv` command. – 576i Sep 30 '18 at 13:24
  • Yes, I validated this by saving your example and then running the code. My json export looks fine, so please check your example. – 576i Sep 30 '18 at 13:30
  • Hi, I just tried it with the print and the debugging. Strangely the only issue with my output is that the values for "col1" comes after "subset", its not a huge deal breaker since its in the same format. I'm on python 2.7 and Mac, it could be a reason as to why it changes though. Thanks :) – RinW Sep 30 '18 at 22:54
  • 1
    If you are on the old Python 2.7, use OrderedDict https://stackoverflow.com/questions/10844064/items-in-json-object-are-out-of-order-using-json-dumps when you do `results.append(...`. In case you are not aware of it, Python2.7's end of life is 2020 and Pandas will drop Python 2.7 support on January 1st, 2019, so if you can, move to Python3 now. – 576i Oct 01 '18 at 08:53