0

I have two files, file_1.csv has several lines, each line contains several numbers, like:

1,2,3
4,5,6


7
8,9
...

while file_2.csv counts the number of numbers per line in file_1, like:

3
3
0
0
1
2
...

I want use the data with MongoDB, so I am reformulating the data from csv to json. The json file should have three key-value pairs. First key shows the line index. Second key includes the numbers from file_1. Key three includes data from file_2, like this:

{
    “id”: 1,
    "numbers": [1,2,3],
    "count": 3
}
...

Of course I can do it by with open().. to read them to arrays, and then formulate the data in a json format, but I am wondering if there is any other way to do it faster, since my data is quite large? Besides, I am not very sure how can I formulate integer array to json value in python?

Many many thanks!

jmunsch
  • 22,771
  • 11
  • 93
  • 114
  • How large of a file? How often are you going to have to run the script? – jmunsch Jul 16 '20 at 15:31
  • like several hundred GB for about 60000 files in total, just once. I can actually use multiprocessing. – Erwin Zangwill Jul 16 '20 at 15:32
  • 1
    Wouldn't it be easier to skip the second file? Since you are already reading the first one, and you could know the size – kachus22 Jul 16 '20 at 15:36
  • yeah, you are right, but I just though that would be fester? – Erwin Zangwill Jul 16 '20 at 15:39
  • @ErwinZangwill which version of MongoDB? might be able to use `mongoimport` also see: https://stackoverflow.com/questions/8717179/chunking-data-from-a-large-file-for-multiprocessing – jmunsch Jul 16 '20 at 15:39
  • @jmunsch I think the link was not right? – Erwin Zangwill Jul 16 '20 at 15:43
  • the first link was for chunking large files and sending to python multiprocessing. in regards to using `mongoimport` this looks related: https://stackoverflow.com/questions/44622394/import-csv-data-as-array-in-mongodb-using-mongoimport – jmunsch Jul 16 '20 at 15:45
  • I would guess that transforming the files, and then using `mongoimport` would be faster, however, you still might face issues related to recovering from a failed import, and rolling the changes back, duplicate data, and keeping track of what's been imported already, etc. ( Ensuring a long running import is idempotent when re-run ) Which Operating System? – jmunsch Jul 16 '20 at 15:50
  • Also, if the data is being sent over a network, the bottleneck might be in sending the data. – jmunsch Jul 16 '20 at 15:53
  • @jmunsch redhat is the os. By "transforming the files" do you mean transforming the csv to json file? – Erwin Zangwill Jul 17 '20 at 11:09
  • @ErwinZangwill no i mean transforming the csv files to be properly formatted csv for use with `mongoimport` – jmunsch Jul 17 '20 at 16:25

0 Answers0