1

I need to convert a lot of large files from JSON to CVS format, I did it this way before:

import csv
import json

f = open('test.json')
data = json.load(f)
f.close()

f = csv.writer(open('test.csv', 'wb+'))

for x in data:
    f.writerow([x["host"],
                x["port"]])

It works if the format is correctly maintained, but now I have the following problems:

  1. My one file can be about 500 gigabytes.

Is there any way to optimize my code, make it faster?

  1. The main problem is that this fairly large file will contain a not quite correct JSON format, i.e.
{"data":"abc","host":"12.34.56.78","path":"/","port":123}
{"data":"abc","host":"12.34.56.78","path":"/","port":123}
{"data":"abc","host":"12.34.56.78","path":"/","port":123}
...

as you can see there is missing [ at the beginning and ] at the end of a file, "," at the end of the lines - these are the factors that do not allow to convert the file correctly at the moment.

I know that it is possible to open a file first, to add at the end of each line "," and also to add [ and ] - but it is additional operations with a file of the big size, especially if there are many of them can reduce performance. Or is it somehow possible to do it in my case without losing efficiency and with minimal calls to the file?

Maybe someone can suggest the most effective solution?

KarlsD
  • 649
  • 1
  • 6
  • 12
  • 4
    You don't want to be encapsulating it in a list - this is [jsonlines](http://jsonlines.org/) and specifically designed to allow you to iterate rows without loading the full JSON structure into memory – roganjosh Jul 29 '20 at 18:13

1 Answers1

0

To increase performance you should consider the special format of your input:

  1. It appears to be a JSON string per-line, not per-file
  2. The lines have a simple, flat structure

You might increase your performance two-fold:

  1. Load the file contents and write the corresponding CSV line-by-line
  2. Do not use a JSON parser, instead use regex replacement per line with capture groups for the respective data fields. This way you can get the desired format in one go

Here is proof-of-concept code using memory mapping (always good to benchmark yourself though if mmap is indeed faster):

import re
import mmap

matcher = re.compile(r'^{"data":"(?P<data>\w*)","host":"(?P<host>\d+\.\d+\.\d+\.\d+)" ... }$')
# alternative: only include the parts that you are interested in (e.g. host, port) in the matcher and ignore the format of the rest
matcher = re.compile(r'"host":"(?P<host>\d+\.\d+\.\d+\.\d+)".*"port":(?P<port>\d+)')

with open(args.filename, 'rb') as f:
    buffer = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
    for line in iter(buffer.readline, b''):
        line = matcher.sub(r'\g<host>,\g<port>', line.decode('unicode_escape'))
        print(line)
    buffer.close()
ypnos
  • 50,202
  • 14
  • 95
  • 141
  • thank you for helping me out; you probably meant: `line = matcher.sub(r'\g,\g', line.decode('unicode_escape'))` right? – KarlsD Jul 29 '20 at 21:18
  • Is it more profitable to work with a large file (e.g. 500 gigabytes) or is it worth splitting it and working with smaller files? – KarlsD Jul 30 '20 at 09:54
  • This depends mostly on the operating system and your general setup, filesystem etc. – ypnos Jul 30 '20 at 10:41