How can I reshape data in a csv into a structured format?

Question

I have some .csv files that are generated from Computational Fluid Dynamics simulations. They contain the values of the velocity, pressure, density, etc at given points in space. For each point, its coordinates and the values of the fields at that point are printed on a row in the csv file. For a 2D grid with x values of 1,2,3 and y values of 4,5,6, the data is arranged in the following way:


X Y (field variables)
1 4         :
2 4         :
3 4         :
1 5         :
2 5         :
3 5         :
1 6         :
2 6         :
3 6         :

We start with the lowest y value, cycle through all the x values, then go to the next y value and repeat.

What I would like to do is put this data into a structured format. I.e, I would like to put the data into a xarray dataset that uses the x and y values as coordinate axes, or put the values into a numpy ndarray of the proper shape (in this case, 3x3.). I could load the file into a Pandas dataframe and then restructure the data manually using for loops, but this is extremely slow for even moderately large data files. I would like a faster way that uses inbuilt functions from the pandas, numpy, and xarray libraries.

Anybody have any ideas?

You can apply a pivot like `df.pivot_table(index="X", columns="Y"])`. SO dupe: https://stackoverflow.com/questions/47152691/how-can-i-pivot-a-dataframe , pandas guide: https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html#reshaping docs for df.pivot_table: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.pivot_table.html — Chrysophylaxs, Jan 29 '23 at 23:02
@Chrysophylaxs Can I easily reprocess it into an xarray dataset? That would be ideal. — K.defaoite, Jan 30 '23 at 11:17

score 1 · Accepted Answer · answered Jan 30 '23 at 21:32

Here's an example that uses Xarray and Pandas together:

from io import StringIO

import pandas as pd
import xarray as xr

s = StringIO("""x,y,rho
1,4,0.503
2,4,0.642
3,4,0.041
4,4,0.964
1,5,0.340
2,5,0.269
3,5,0.288
4,5,0.702
1,6,0.511
2,6,0.732
3,6,0.195
4,6,0.226
1,7,0.957
2,7,0.032
3,7,0.304
4,7,0.607
""")

# open csv, create a MultiIndex from x/y columns
df = pd.read_csv(s, index_col=['x', 'y'])

# convert to Xarray Dataset
ds = df.to_xarray()

# copy over the multiindex
ds['points'] = df.index

# use ds.unstack to reshape
ds = ds.unstack()

This will produce an Xarray Dataset that looks like:

<xarray.Dataset>
Dimensions:  (x: 4, y: 4)
Coordinates:
  * x        (x) int64 1 2 3 4
  * y        (y) int64 4 5 6 7
Data variables:
    rho      (x, y) float64 0.503 0.34 0.511 0.957 ... 0.964 0.702 0.226 0.607

Wow, really nice. I will give this a try. If it works I will accept this answer. — K.defaoite, Feb 01 '23 at 20:43
the second to last line threw an error and the last line doesn't appear to do anything, but its a working solution anyway, so thanks! — K.defaoite, Feb 02 '23 at 13:31

score 0 · Answer 2 · answered Jan 29 '23 at 23:50

Fellow CFD researcher here. I believe this can be accomplished manually with a single for loop that iterates only over your state variable list (i.e. rho).

# https://stackoverflow.com/questions/75278985/how-can-i-reshape-data-in-a-csv-into-a-structured-format
import time
start = time.time()

import numpy as np
import pandas as pd

df = pd.read_csv('test_data.csv')

min_x_coor = min(df['x'])
min_y_coor = min(df['y'])

x_dim = max(df['x']) - min_x_coor + 1
y_dim = max(df['y']) - min_y_coor + 1

rho_array = np.zeros((x_dim, y_dim))
for p in range(0, len(df['rho'])):
   x_coor = df['x'][p] - min_x_coor
   y_coor = df['y'][p] - min_y_coor

   rho_array[x_coor][y_coor] = df['rho'][p]

print(rho_array)
print(time.time() - start)

For 3x3 data:

x,y,rho
1,4,0.503
2,4,0.642
3,4,0.041
1,5,0.340
2,5,0.269
3,5,0.288
1,6,0.511
2,6,0.732
3,6,0.195

Output:

[[0.503 0.34  0.511]
 [0.642 0.269 0.732]
 [0.041 0.288 0.195]]
0.31889796257019043

For 4x4 data:

x,y,rho
1,4,0.503
2,4,0.642
3,4,0.041
4,4,0.964
1,5,0.340
2,5,0.269
3,5,0.288
4,5,0.702
1,6,0.511
2,6,0.732
3,6,0.195
4,6,0.226
1,7,0.957
2,7,0.032
3,7,0.304
4,7,0.607

Output:

[[0.503 0.34  0.511 0.957]
 [0.642 0.269 0.732 0.032]
 [0.041 0.288 0.195 0.304]
 [0.964 0.702 0.226 0.607]]
0.48914408683776855

Thanks, but I explicitly stated in my post that I want to *avoid* for loops like `for p in range(0, len(df['rho'])):...` because they are very slow. — K.defaoite, Jan 30 '23 at 10:35
Or rather, I'm sorry. Because this is only a single loop over the rows and not nested loops ir probably isn't too bad. — K.defaoite, Jan 30 '23 at 10:37
Your definitions of x_dim and y_dim do not make any sense however. — K.defaoite, Jan 30 '23 at 11:19
@K.defaoite, `x_dim` and `y_dim` are the initialized dimensions for the `rho_array`. Hope that clears it up. Additionally, this code can very easily be modified to include an additional data column, i.e. `v`. You'd just need to include `v_array = np.zeros((x_dim, y_dim))`, and within the loop, `rho_array[x_coor][y_coor] = df['v'][p]`. — Jacob Ivanov, Jan 30 '23 at 15:07

How can I reshape data in a csv into a structured format?

2 Answers2

Linked