2

I have a dataframe such that the column contains both json objects and strings. I want to get rid of rows that does not contains json objects.

Below is how my dataframe looks like :

import pandas as pd

df = pd.DataFrame({'A': ["hello","world",{"a":5,"b":6,"c":8},"usa","india",{"a":9,"b":10,"c":11}]})

print(df)

How should i remove the rows that contains only strings, so that after removing those string rows, I can apply below to this column to convert json object into separate columns of dataframe:

from pandas.io.json import json_normalize
df = json_normalize(df['A'])
print(df)
Nikita Gupta
  • 495
  • 9
  • 24
  • It's not json once you've made your df, it's a dict. But it's keeping me occupied trying to selectively keep those columns for sure :) – roganjosh Oct 20 '17 at 19:56
  • Yes , by json I meant dict object only. Any idea how do i remove all the rows that contain simple strings like "hello", "world" etc – Nikita Gupta Oct 20 '17 at 19:57
  • Kindly went to this question https://stackoverflow.com/questions/46856988/np-isreal-behavior-different-in-pandas-dataframe-and-numpy-array – BENY Oct 20 '17 at 21:45

3 Answers3

3

I think I would prefer to use an isinstance check:

In [11]: df.loc[df.A.apply(lambda d: isinstance(d, dict))]
Out[11]:
                            A
2    {'a': 5, 'b': 6, 'c': 8}
5  {'d': 9, 'e': 10, 'f': 11}

If you want to include numbers too, you can do:

In [12]: df.loc[df.A.apply(lambda d: isinstance(d, (dict, np.number)))]
Out[12]:
                            A
2    {'a': 5, 'b': 6, 'c': 8}
5  {'d': 9, 'e': 10, 'f': 11}

Adjust this to whichever types you want to include...


The last step, json_normalize takes a list of json objects, for whatever reason a Series is no good (and gives the KeyError), you can make this a list and your good to go:

In [21]: df1 = df.loc[df.A.apply(lambda d: isinstance(d, (dict, np.number)))]

In [22]: json_normalize(list(df1["A"]))
Out[22]:
     a    b    c    d     e     f
0  5.0  6.0  8.0  NaN   NaN   NaN
1  NaN  NaN  NaN  9.0  10.0  11.0
Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
  • I much prefer this answer. Since the other discussion seems to be going nowhere, do you happen to know why `isreal` works so you can point me in the right direction in reading? – roganjosh Oct 20 '17 at 20:18
  • on applying "normalize code" after appying your code, it gives key error. – Nikita Gupta Oct 20 '17 at 20:22
  • 1
    @roganjosh I have no idea, I think you'd need to look at the code - I don't think np.isreal is intended to be used like that (I wouldn't like to rely on it) – Andy Hayden Oct 20 '17 at 20:32
  • 1
    @NikitaGupta added last step to avoid the KeyError (you need to pass json_normalize a list rather than a Series) – Andy Hayden Oct 20 '17 at 20:36
  • @AndyHayden , I'm reading this dataframe from a csv, so the column contains both json data and string data, and when I'm applying your code, I'm getting a dataframe with 0 rows in it. Is it like as I have read it from csv, so dict object are treated as the same as that of strings in the same column? – Nikita Gupta Oct 20 '17 at 22:36
  • @NikitaGupta do `isinstance(d, (dict, np.number, basestring))` or `isinstance(d, (dict, np.number, str))` depending on your python version – Andy Hayden Oct 20 '17 at 22:38
  • @AndyHayden my python is 3.6 – Nikita Gupta Oct 20 '17 at 22:39
  • @NikitaGupta Might be good idea to ask another question if it's how to get a csv into DataFrame which has dict objects? (if that is the issue). – Andy Hayden Oct 20 '17 at 22:39
  • @AndyHayden, that column contains data in both forms dict as well as strings, and when i check datatype of this column after reading from csv and storing in dataframe , the data type is object – Nikita Gupta Oct 20 '17 at 22:41
  • @NikitaGupta verify the things you think are dicts are actually dicts (e.g. by looking at a single element), I claim they may be strings like `'{}'` or `'{"hello": "world"}'` rather than an actual dict like `{}` or `{"hello": "world"}`. – Andy Hayden Oct 20 '17 at 22:44
  • @AndyHayden, ok so the type is string. but it's not like this in dataframe '{"hello": "world"}' or csv, instead on both the places its without quotes – Nikita Gupta Oct 20 '17 at 22:50
  • @NikitaGupta I have a good answer (to use `ast.literal_eval`, please ask another question, I think doesn't really belongs on this one). – Andy Hayden Oct 21 '17 at 00:01
  • @AndyHayden, I have asked this question : https://stackoverflow.com/questions/46858848/dict-objects-converting-to-string-when-read-from-csv-to-dataframe-pandas-python – Nikita Gupta Oct 21 '17 at 00:09
1
df[df.applymap(np.isreal).sum(1).gt(0)]
Out[794]: 
                            A
2    {'a': 5, 'b': 6, 'c': 8}
5  {'d': 9, 'e': 10, 'f': 11}
BENY
  • 317,841
  • 20
  • 164
  • 234
  • 1
    Please explain, what its exactly doing – Nikita Gupta Oct 20 '17 at 20:01
  • I'm also confused by what this does. The docs don't offer much, certainly for strings. Is this a side effect? – roganjosh Oct 20 '17 at 20:03
  • `df[df.applymap(np.isreal).values]` may be a bit more concise. – cmaher Oct 20 '17 at 20:04
  • np.isreal: https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.isreal.html – BENY Oct 20 '17 at 20:06
  • That doesn't explain what happens here at all though unless you have some external knowledge. I have read the docs. It doesn't explain why it should filter dicts from strings. – roganjosh Oct 20 '17 at 20:07
  • ok but when I'm applying . this code from pandas.io.json import json_normalize df = json_normalize(df['A']) print(df) to your result, it's throwing error – Nikita Gupta Oct 20 '17 at 20:08
  • So.. a dict is real, a string is not? ...the docs don't mention that. – Andy Hayden Oct 20 '17 at 20:10
  • @Wen what might be obvious to you certainly isn't to others. Neither a dict or a string has a complex part... because it's not a complex number. I'm not asking for you to post a tutorial on this, but certainly something more than the docs that explains what's happening because I'm clueless on why this works. – roganjosh Oct 20 '17 at 20:10
  • 1
    @roganjosh seems like just passed wrong type of dput , Will open a question about it, Sorry for the late reply , just done with meeting ... – BENY Oct 20 '17 at 20:26
  • The question/answer is [here](https://stackoverflow.com/questions/46856988/np-isreal-behavior-different-in-pandas-dataframe-and-numpy-array?noredirect=1#46857112). – roganjosh Oct 20 '17 at 20:58
0

If you want an ugly solution that also works...here's a function I created that finds columns that contain only strings, and returns the df minus those rows. (since your df has only one column, you'll just dataframe containing 1 column with all dicts). Then, from there, you'll want to use df = json_normalize(df['A'].values) instead of just df = json_normalize(df['A']).

For a single column dataframe...

import pandas as pd
import numpy as np
from pandas.io.json import json_normalize
def delete_strings(df):
    nrows = df.shape[0]
    rows_to_keep = []
    for row in np.arange(nrows):
        if type(df.iloc[row,0]) == dict:
            rows_to_keep.append(row) #add the row number to list of rows 
                                     #to keep if the row contains a dict
    return df.iloc[rows_to_keep,0] #return only rows with dicts
df = pd.DataFrame({'A': ["hello","world",{"a":5,"b":6,"c":8},"usa","india",
                         {"a":9,"b":10,"c":11}]})
df = delete_strings(df)
df = json_normalize(df['A'].values)
print(df)
#0      {'a': 5, 'b': 6, 'c': 8}
#1    {'a': 9, 'b': 10, 'c': 11}   

For a multi-column df (also works with a single column df):

def delete_rows_of_strings(df):
    rows = df.shape[0] #of rows in df
    cols = df.shape[1] #of coluns in df
    rows_to_keep = [] #list to track rows to keep
    for row in np.arange(rows): #for every row in the dataframe
        #num_string will count the number of strings in the row
        num_string = 0
        for col in np.arange(cols):  #for each column in the row...
            #if the value is a string, add one to num_string
            if type(df.iloc[row,col]) == str:
                num_string += 1
        #if num_string, the number of strings in the column,
        #isn't equal to the number of columns in the row...
        if num_string != cols: #...add that row number to the list of rows to keep
            rows_to_keep.append(row)
    #return the df with rows containing at least one non string
    return(df.iloc[rows_to_keep,:])


df = pd.DataFrame({'A': ["hello","world",{"a":5,"b":6,"c":8},"usa","india"],
                        'B' : ['hi',{"a":5,"b":6,"c":8},'sup','america','china']})
#                          A                         B
#0                     hello                        hi
#1                     world  {'a': 5, 'b': 6, 'c': 8}
#2  {'a': 5, 'b': 6, 'c': 8}                       sup
print(delete_rows_of_strings(df))
#                          A                         B
#1                     world  {'a': 5, 'b': 6, 'c': 8}
#2  {'a': 5, 'b': 6, 'c': 8}                       sup
Dave Rosenman
  • 1,252
  • 9
  • 13