1

I have a huge (>1GB text file) that I want to parse and convert into smaller files.

My text file looks like this:

Iteration column1     column2     ....   column 10k
1         data_1_1    data_1_2           data_1_10k
2         data_2_1    data_2_2           data_2_10k
...
10k       data_10k_1    data_10k_2       data_10k_10k

I want to parse this text file and convert it into 10k CSV files where each CSV file will contain the following data:

Iteration,   column
1,           data_1
2,           data_2
...,
10k,         data_10k

I'm looking for the fastest method to do this in python. Would it be possible to parallelize this into 10k chunks?

lonely
  • 681
  • 1
  • 7
  • 10
  • 3
    Does this answer your question? [How to read / process large files in parallel with Python](https://stackoverflow.com/questions/50636059/how-to-read-process-large-files-in-parallel-with-python) – Philip Jun 19 '20 at 08:39
  • Please use `GB` for gigabyte.... https://en.m.wikipedia.org/wiki/Gigabyte – Mark Setchell Jun 19 '20 at 08:41

1 Answers1

2

I think that if you file is "well formatted" you can easily use numpy function to load chunks of text files. With np.loadtxt() you can set the number of rows to skip and the number of row to read. In this way you can setup a simple for loop, read the file in chunks and write to another file.

If you wish to use multiprocessing you have to write a function that reads a chunk of the text file and saves it. Then, using pool.map()- or pool.apply_async()-methods, you can iterate through the file in a similar way as above but using multiprocessing-module.

user3666197
  • 1
  • 6
  • 50
  • 92
Edo98
  • 433
  • 2
  • 9