0

I am trying to figure out a way with the pandas library to take a 3 column csv file and turn it into a 5 column csv file with formatting in place for the last column. I Then need to save the output. I was able to write this in powershell but I cant figure it out in python and I need it to be extraordinary efficient. I am using a 3 column 140 million row file with this. Example of what im trying to do:

Test-data 1234567 123456789

Test-data 1234567 123456789

To ->

Test-data 1234567 123456789 123-45-6789 123/45/6789

Test-data 1234567 1234667890 123-45-6789 123/45/6789

Thanks in advance for any help.

1 Answers1

1

Here is an example solution

import pandas as pd

chunksize = 10 ** 6
for chunk in pd.read_csv("sample_data.csv", chunksize=chunksize, engine="c", header=None, sep=" ", dtype="string"):
    chunk[3] = chunk[2].map(lambda x: x[:3] + "-" + x[3:5] + "-" + x[5:])
    chunk[4] = chunk[2].map(lambda x: x[:3] + "/" + x[3:5] + "/" + x[5:])
    print(chunk.head(10))

    chunk.to_csv("sample_output.csv", header=None, mode='a')

Where I have used the following as points of reference:

Writing large Pandas Dataframes to CSV file in chunks

How do I read a large csv file with pandas?

bmbigbang
  • 1,318
  • 1
  • 10
  • 15