what is the fastest way to extract values line by line from several csv files to a json file in python

Question

I have two files, file_1.csv has several lines, each line contains several numbers, like:

1,2,3
4,5,6


7
8,9
...

while file_2.csv counts the number of numbers per line in file_1, like:

I want use the data with MongoDB, so I am reformulating the data from csv to json. The json file should have three key-value pairs. First key shows the line index. Second key includes the numbers from file_1. Key three includes data from file_2, like this:

{
    “id”: 1,
    "numbers": [1,2,3],
    "count": 3
}
...

Of course I can do it by with open().. to read them to arrays, and then formulate the data in a json format, but I am wondering if there is any other way to do it faster, since my data is quite large? Besides, I am not very sure how can I formulate integer array to json value in python?

Many many thanks!

How large of a file? How often are you going to have to run the script? — jmunsch, Jul 16 '20 at 15:31
like several hundred GB for about 60000 files in total, just once. I can actually use multiprocessing. — Erwin Zangwill, Jul 16 '20 at 15:32
Wouldn't it be easier to skip the second file? Since you are already reading the first one, and you could know the size — kachus22, Jul 16 '20 at 15:36
yeah, you are right, but I just though that would be fester? — Erwin Zangwill, Jul 16 '20 at 15:39
@ErwinZangwill which version of MongoDB? might be able to use `mongoimport` also see: https://stackoverflow.com/questions/8717179/chunking-data-from-a-large-file-for-multiprocessing — jmunsch, Jul 16 '20 at 15:39
the first link was for chunking large files and sending to python multiprocessing. in regards to using `mongoimport` this looks related: https://stackoverflow.com/questions/44622394/import-csv-data-as-array-in-mongodb-using-mongoimport — jmunsch, Jul 16 '20 at 15:45
I would guess that transforming the files, and then using `mongoimport` would be faster, however, you still might face issues related to recovering from a failed import, and rolling the changes back, duplicate data, and keeping track of what's been imported already, etc. ( Ensuring a long running import is idempotent when re-run ) Which Operating System? — jmunsch, Jul 16 '20 at 15:50
Also, if the data is being sent over a network, the bottleneck might be in sending the data. — jmunsch, Jul 16 '20 at 15:53
@jmunsch redhat is the os. By "transforming the files" do you mean transforming the csv to json file? — Erwin Zangwill, Jul 17 '20 at 11:09
@ErwinZangwill no i mean transforming the csv files to be properly formatted csv for use with `mongoimport` — jmunsch, Jul 17 '20 at 16:25

what is the fastest way to extract values line by line from several csv files to a json file in python

0 Answers0