1

I have some very large CSV files (+15Gb) that contain 4 initial rows of meta data / header info and then the data. The first 3 columns are 3D Cartesian coordinates and are the values I need to change with basic maths operations. e.g. Add, subtract, multiple, divide. I need to do this on mass to each of the coordinate columns. The first 3 columns are float type values

The rest of the columns in the CSV could be of any type, e.g. string, int, etc....

I currently use a script where I can read in each row of the csv and make the modification, then write to a new file and it seems to work fine. But the problem is it takes days on a large file. The machine I'm running on has plenty of memory (120Gb), but mu current method doesn't utilise that.

I know I can update a column on mass using a numpy 2D array if I skip the 4 metadata rows. e.g

arr = np.genfromtxt(input_file_path, delimiter=',', skip_header=4)
arr[:,0]=np.add(arr[:,0],300)

this will update the first column by adding 300 to each value. But the issue I have with trying to use numpy is

  1. Numpy arrays don't support mixed data types for the rest of the columns that will be imported (I don't know what the other columns will hold so I can't use structured arrays - or rather i want it to be a universal tool so I don't have to know what they will hold)

  2. I can export the numpy array to csv (providing it's not mixed types) and just using regular text functions I can create a separate CSV for the 4 rows of metadata, but then I need to somehow concatenate them and I don't want to have read through all the lines of the data csv just to append it to the bottom of the metadata csv.

I know if I can make this work with Numpy it will greatly increase the speed by utilizing the machine's large amount of memory, by holding the entire csv in memory while I do operations. I've never used pandas but would also consider using it for a solution. I've had a bit of a look into pandas thinking I maybe able to do it with dataframes but I still need to figure out how to have 4 rows as my column header instead of one and additionally I haven't seen a way to apply a mass update to the whole column (like I can with numpy) without using a python loop - not sure if that would make it slow or not if it's already in memory.

Image of potential data

The metadata can be empty for rows 2,3,4 but in most cases row 4 will have the data type recorded. There could be up to 200 data columns in addition to the initial 3 coordinate columns.

My current (slow) code looks like this:

import os
import subprocess
import csv
import numpy as np


def move_txt_coords_to(move_by_coords, input_file_path, output_file_path):

    # create new empty output file
    open(output_file_path, 'a').close()

    with open(input_file_path, newline='') as f:
        reader = csv.reader(f)
        for idx, row in enumerate(reader):
            if idx < 4:
                append_row(output_file_path, row)
            else:
                new_x = round(float(row[0]) + move_by_coords['x'], 3)
                new_y = round(float(row[1]) + move_by_coords['y'], 3)
                new_z = round(float(row[2]) + move_by_coords['z'], 3)
                row[0] = new_x
                row[1] = new_y
                row[2] = new_z
                append_row(output_file_path, row)


def append_row(output_file, row):
    f = open(output_file, 'a', newline='')
    writer = csv.writer(f, delimiter=',')
    writer.writerow(row)
    f.close()


if __name__ == '__main__':
    move_by_coords = {
        'x': -338802.5,
        'y': -1714752.5,
        'z': 0
    }

    input_file_path = r'D:\incoming_data\large_data_set1.csv'
    output_file_path = r'D:\outgoing_data\large_data_set_relocated.csv'
    move_txt_coords_to(move_by_coords, input_file_path, output_file_path)
Roochiedoor
  • 887
  • 12
  • 19
  • Have you considered using [pandas](https://pandas.pydata.org/) dataframes to read/load the csv and then apply the operations only on the desired columns as if they were numpy arrays? Remark that pandas won't be efficient if you load all the data together, thus use batchsize! Otherwise you can check out the [dask](https://dask.org/) library – FBruzzesi Mar 21 '20 at 08:55
  • 1
    Not sure why you ask a new question - all what you need can be found on SO: [skip-rows-during-csv-import-pandas](https://stackoverflow.com/questions/20637439/skip-rows-during-csv-import-pandas) and [how-to-read-a-6-gb-csv-file-with-pandas](https://stackoverflow.com/questions/25962114/how-to-read-a-6-gb-csv-file-with-pandas) and `df[x] += delta_x` for a basic "add-offset-value" to one column - else you would need to use iloc etc. – Patrick Artner Mar 21 '20 at 08:55
  • @Patrick I don't want to skip any rows. I was just doing that so I could get it into numpy. Regardless of that I've now discovered how to do it with Pandas while retaining my 4 header rows. But now I'm trying to work around the issue of "Unnamed" appearing in cells that were blank. I still want those cells to be blank. `df = pd.read_csv(input_file_path, header=[0,1,2,3])` did the trick for the 4 header rows. and I was able to just do a simple `df.centroid_x += move_by_coords['x'] ` to update specific columns. @FBruzzesi I will need to look into batchsize thanks for that bit of advice – Roochiedoor Mar 21 '20 at 09:08

1 Answers1

1

Okay so I've got an almost complete answer and it was so much easier than trying to use numpy.

import pandas pd

    input_file_path = r'D:\input\large_data.csv'
    output_file_path = r'D:\output\large_data_relocated.csv'

    move_by_coords = {
            'x': -338802.5,
            'y': -1714752.5,
            'z': 0
        }

    df = pd.read_csv(input_file_path, header=[0,1,2,3])
    df.centroid_x += move_by_coords['x']
    df.centroid_y += move_by_coords['y']
    df.centroid_z += move_by_coords['z']

    df.to_csv(output_file_path,sep=',')

But I have one remaining issue (possibly 2). The blanks cells in my header are being populated with Unnamed. I somehow need it to sub in a blank string for those in the header row.

enter image description here

Also @FBruzzesi has warned me I made need to use batchsize to make it more efficient which i'll need to check out.

---------------------Update------------- Okay I resolved the multiline header issue. I just use the regular csv reader module to read the first 4 rows into a list of rows, then I transpose this to be a list of columns where I convert the column list to tuples at the same time. Once I have a list of column header tuples (where the tuples consist of each of the rows within that column header), I can use the list to name the header. I there fore skip the header rows on reading the csv to the data frame, and then update each column by it's index. I also drop the index column on export back to csv once done. It seems work very well.

import csv
import itertools
import pandas as pd


def make_first_4rows_list_of_tuples(input_csv_file_path):
    f = open(input_csv_file_path, newline='')
    reader = csv.reader(f)
    header_rows = []
    for row in itertools.islice(reader, 0, 4):
        header_rows.append(row)

    header_col_tuples = list(map(tuple, zip(*header_rows)))
    print("Header columns: \n", header_col_tuples)
    return header_col_tuples


if __name__ == '__main__':
    move_by_coords = {
        'x': 1695381.5,
        'y': 5376792.5,
        'z': 100
    }

    input_file_path = r'D:\temp\mydata.csv'
    output_file_path = r'D:\temp\my_updated_data.csv'

    column_headers = make_first_4rows_list_of_tuples(input_file_path)
    df = pd.read_csv(input_file_path, skiprows=4, names=column_headers)
    df.iloc[:, 0] += move_by_coords['x']
    df.iloc[:, 1] += move_by_coords['y']
    df.iloc[:, 2] += move_by_coords['z']
    df.to_csv(output_file_path, sep=',', index=False)

updated and exported csv

Roochiedoor
  • 887
  • 12
  • 19
  • 1
    try adding `, index_col=0` to your `read_csv` statement - not sure if it helps - see [https://pandas-docs.github.io/pandas-docs-travis/reference/api/pandas.read_csv.html#pandas.read_csv](https://pandas-docs.github.io/pandas-docs-travis/reference/api/pandas.read_csv.html#pandas.read_csv) – Patrick Artner Mar 21 '20 at 09:42