1

I'm trying to extract a field from a json that contains a list then append that list to a dataframe, but I'm running in to a few different errors.

I think I can write it to a csv then read the csv with Pandas, but I'm trying to avoid writing any files. I know that I can also use StringIO to make a csv, but that has issues with null bytes. Replacing those would be (i think) another line-by-line step that will further extend the time the script takes to complete... i'm running this against a query that returns thens of thousands of results so keeping it fast and simple is a priority

First I tried this:

hit_json = json.loads(hit)
for ln in hit_json.get('hits').get('hits'):
    df = df.append(ln['_source'], ignore_index=True)
print(df)

This gives me a result that looks like this:

1    2           3      4 
a    b    d,e,f...      x

Then I tried this:

 df = df.append(ln['_source']['payload'], ignore_index=True)

But that gives me this error:

TypeError: cannot concatenate object of type "<class 'str'>"; only pd.Series, 
pd.DataFrame, and pd.Panel (deprecated) objs are valid

What I'm looking for would be something like this:

0  1  2  3  4
d  e  f  g  h

On top of this... I need to figure out a way to handle a specific string in this list that contains a comma... which may be a headache that's best handled in a different question... something like:

# Obviously this is incorrect but I think you get the idea :)
str.replace(',', '^')
    except if ',' followed by ' '

Greatly appreciate any help!

EDITING TO ADD JSON AS REQUESTED

{
"_index": "sanitized",
"_type": "sanitized",
"_id": "sanitized".,
"_score": sanitized,
"_source": {
    "sanitized": sanitized,
    "sanitized": "1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,\"34,35\",36,37,38,39,40",
    "sanitized": "sanitized",
    "sanitized": ["sanitized"],
    "sanitized": "sanitized",
    "sanitized": "sanitized",
    "sanitized": "sanitized",
    "sanitized": "sanitized",
}
}]
}
}
r1ty
  • 45
  • 2
  • 5

2 Answers2

0

You can maybe write a temporary file with StringIO, like it's done here.

Then for the second part you could do

if ',' in data and ', ' not in data:
    data = data.replace(',', '^')
  • I've tried this as well... it's slow and has issues with null bytes, if there's a null byte it just completely fails. The fix to that would be to replace the null bytes, but that's another step that goes line by line. This is definitely a possible solution, though I'm hoping i can get it to work without StringIO – r1ty Jun 19 '18 at 20:48
0

You can try the following

hit_json = json.loads(hit)
for ln in hit_json.get('hits').get('hits'):
     data = ln['_source']["payload"].split(",")
     df.loc[len(df)] = pd.Series(data, index=range(len(data)))
print(df)

The benefit of the loc is that you will not create a new dataframe each time so it will be fast. You can find the post here.

I would also like to suggest an alternative that can be faster. First create a dictionary with all the data and then dump the dictionary into a dataframe.

Abhishek Mishra
  • 1,984
  • 1
  • 18
  • 13