0

I have a nested numpy.ndarray of the following format (each of the sublists has the same size)

len(exp_data) # Timepoints
Out[205]: 42

len(exp_data[0])
Out[206]: 1

len(exp_data[0][0]) # Y_bins
Out[207]: 13

len(exp_data[0][0][0]) # X_bins
Out[208]: 43

type(exp_data[0][0][0][0])
Out[209]: numpy.float64

I want to move these into a pandas DataFrame such that there are 3 columns numbered from 0 to N and the last one with the float value. I could do this with a series of loops, but that seems like a very non-efficient way of solving the problem.

In addition I would like to get rid of any nan values (not present in sample data). Do I do this after creating the df or is there a way to skip adding them in the first place?

NOTE: code below has been edited and I've added sample data

import random
import numpy as np
import pandas as pd

exp_data = [[[ [random.random() for x in range (5)],
                  [random.random() for x in range (5)],
                  [random.random() for x in range (5)],
                   ]]]*5
exp_data[0][0][0][1]=np.nan

df = pd.DataFrame(columns = ['Timepoint','Y_bin','X_bin','Values'])

for t,timepoint in enumerate(exp_data):
    for y,y_bin in enumerate(timepoint[0]):
        for x,x_bin in enumerate(y_bin):
            df.loc[len(df)] = [int(t),int(y),int(x),x_bin]

df = df.dropna().reset_index(drop=True)

The final format should be as follows (except I'd preferably like integers instead of floats in first 3 columns, but not essential; int(t) etc. doesn't do the trick)

df
Out[291]: 
    Timepoint  Y_bin  X_bin    Values
0         0.0    0.0    0.0  0.095391
1         0.0    0.0    2.0  0.963608
2         0.0    0.0    3.0  0.855735
3         0.0    0.0    4.0  0.392637
4         0.0    1.0    0.0  0.555199
5         0.0    1.0    1.0  0.118981
6         0.0    1.0    2.0  0.201782
...

len(df) # has received a total of 75 (5*3*5) input values of which 5 are nan
Out[293]: 70
db_
  • 103
  • 1
  • 2
  • 7
  • You might get some responses if you provide the input data in the form of an MCVE. https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples – Rich Andrews Apr 03 '19 at 14:48
  • put example data so we could run your code and change it. – furas Apr 03 '19 at 15:34
  • Thanks for the suggestion, I have now added sample data (shorter form than original and in list form rather than ndarray, but I doubt that matters) – db_ Apr 03 '19 at 17:35

1 Answers1

0

change the format of the float out put to this by adding this piece of code

pd.options.display.float_format = '{:,.0f}'.format

to the end of your code like this to change the format

df = pd.DataFrame(columns = columns)
for t,timepoint in enumerate(exp_data):
for y,y_bin in enumerate(timepoint[0]):
    for x,x_bin in enumerate(y_bin):
        df.loc[len(df)] = [t,y,x,x_bin]
df.dropna().reset_index(drop=True)

pd.options.display.float_format = '{:,.0f}'.format
df
Out[250]: 
    Timepoint  Y_bin  X_bin    Values
0          0    4      10      -2
1          0    4      11      -1
2          0    4      12      -2
3          0    4      13      -2
4          0    4      14      -2
5          0    4      15      -2
6          0    4      16      -3

...

soben360
  • 31
  • 1
  • 5
  • this doesn't actually change anything to the data, just the display mode – db_ Apr 04 '19 at 13:33
  • I think the correct way to do this is: `df[['Timepoint','Y_bin','X_bin']] = df[['Timepoint','Y_bin','X_bin']].astype(int)` This still doesn't solve my main question of how to make the df in the first place without doing a ton of looped loops (which makes the code very slow) – db_ Apr 04 '19 at 13:46