1

I have a huge JSON file (lots of smaller .log (JSON format) files combined together to a total of 8Gb), composed of multiple different objects (where every object takes a row). I want to read this file into a pandas dataframe. I am only interested in collecting the JSON entries for one specific object (this would drastically reduce the file size to read). Can this be done with pandas or python before reading in a dataframe?

My current code is as follows:

import pandas as pd
import glob

df = pd.concat([pd.read_json(f, encoding = "ISO-8859-1", lines=True) for f in glob.glob("logs/sample1/*.log")], ignore_index=True)

As you might imagine, this is very computationally heavy, and takes a lot of time to complete. Is there a way to process this before reading it in a dataframe?

Sample of Data:

{"Name": "1","variable": "value","X": {"nested_var": 5000,"nested_var2": 2000}}
{"Name": "2","variable": "value","X": {"nested_var": 1222,"nested_var2": 8465}}
{"Name": "2","variable": "value","X": {"nested_var": 123,"nested_var2": 865}}
{"Name": "1","variable": "value","X": {"nested_var": 5500,"nested_var2": 2070}}
{"Name": "2","variable": "value","X": {"nested_var": 985,"nested_var2": 85}}
{"Name": "2","variable": "value","X": {"nested_var": 45,"nested_var2": 77}}

I want to only read instances where name = 1

David
  • 1,192
  • 5
  • 13
  • 30

1 Answers1

1

You can use loop by each file, each line and append filtered rows to list, last use DataFrame contructor:

data = []
for file in glob.glob('logs/*.json'):
    with open(file) as f:
        for line in f:
            if json.loads(line)['Name'] == '1':
                data.append(json.loads(line))

df = pd.DataFrame(data)
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Thanks. Does this have any difference in terms of execution speeds when compared to the my method? – David Sep 14 '18 at 12:43
  • @DavidFarrugia - Really hard question, it depends of data, the best test it. – jezrael Sep 14 '18 at 12:44
  • 1
    True. I will test this. I will mark this as the answer since it does what I requested. Thanks. – David Sep 14 '18 at 12:47
  • @jezrael how to write solution using `np.repeat(df[['col1','col2']],repeats=df['col3'].str.len())` for this [question](https://stackoverflow.com/questions/52352226/repeating-the-rows-of-a-data-frame) – Pyd Sep 16 '18 at 08:45