2

I have a large .csv database with a column name VELOCITY containing 3D velocity vectors.

Each element of the VELOCITY column has the form: '(v1, v2, v3)'

To read the data I used:

df = pd.read_csv('database.csv')
df = pd.DataFrame(df)

Now, I tried to define a velocity_array, where every element is 3D velocity vector.

velocity_array = np.asarray(df['VELOCITY'])

and I get something like this:

['(a1, a2, a3)',
 '(b1, b2, b3)',
 '(c1, c2, c3)',
 .
 .
 .
 '(z1, z2, z3)']

which is not what I need. I need to have an array of 3D-velocity-arrays. Something like this:

[[a1, a2, a3],
 [b1, b2, b3],
 [c1, c2, c3],
 .
 .
 .
 [z1, z2, z3]]

so that I could recover a 3D-velocity-array only by calling velocity_array[i] .

Appreciate any help, thanks!

#=====

Just adding (as requested):

df.head() looks like this:

     SC_VELOCITY                        EVENT_ID
0   (-7143.645, -825.2191, -2463.361)   388161
1   (-7143.645, -825.2191, -2463.361)   400028
2   (-7087.896, -1058.8871, -2533.3374) 415847
3   (-7024.463, -1291.3812, -2600.547)  527126
4   (-6953.418, -1522.4622, -2664.9265) 605939 
Diving
  • 784
  • 1
  • 9
  • 17
  • 1
    Can you print a `df.head()` and share it? – Tarifazo Aug 20 '21 at 19:16
  • Do you want to convert a tuple of string to list of integer? –  Aug 20 '21 at 19:32
  • Hi @Mstaino, I will add the df.head(), thanks! – Diving Aug 20 '21 at 19:41
  • 1
    Most likely the source dataframe had tuples in the column, but when saved to csv, and reloaded, those became strings. csv format is really 2d (rows and columns) so can't directly store the extra dimension of such a column. `pandas` uses object dtype for columns with (python) strings – hpaulj Aug 20 '21 at 20:40

3 Answers3

3

Seen from your sample data that each entry in the velocity_array has 2 single quotes enclosing the entry e.g. '(a1, a2, a3)'. Therefore, suppose your entries are actually string entries.

If this is true, you can transform each string in the column to a list by:

df['VELOCITY'] = df['VELOCITY'].str.strip("()").str.split(r',\s*')

Result:

velocity_array = np.asarray(df['VELOCITY'])

print(velocity_array)

Output:

[list(['a1', 'a2', 'a3']) list(['b1', 'b2', 'b3'])
 list(['c1', 'c2', 'c3']) list(['z1', 'z2', 'z3'])]

Edit:

If your a1, a2 entries are actually floating point numbers and you want to get them transformed from string to float numbers also, you can use:

velocity_array = df['VELOCITY'].str.strip("()").str.split(r',\s*', expand=True).astype(float).to_numpy()

Result:

Based on your sample data of df.head():

print(velocity_array)

[[-7143.645   -825.2191 -2463.361 ]
 [-7143.645   -825.2191 -2463.361 ]
 [-7087.896  -1058.8871 -2533.3374]
 [-7024.463  -1291.3812 -2600.547 ]
 [-6953.418  -1522.4622 -2664.9265]]
SeaBean
  • 22,547
  • 3
  • 13
  • 25
  • Hi @SeaBeam, thank your for your reply. I tried to use your suggestion, and df['VELOCITY'][0]=['a1', 'a2','a3'] , and df['VELOCITY'][0][0] = 'a1' ; Those cotes are still a problem, because I can not operate with them. – Diving Aug 20 '21 at 20:11
  • 2
    @Diving See my edit above for the modified solution for converting the entries to numbers so that you can operate with them. Please note that solution using `eval` has 2 issues: 1) very slow for large dataset. 2) security concerns if your input data is from outside sources. Please take caution using the solution. – SeaBean Aug 20 '21 at 20:41
  • Hi @SeaBean, thank you for the extra comment. I am learning with you guys. Thanks! – Diving Aug 21 '21 at 00:20
2

There are several ways to accomplish what you need. Just to add to the answers, and assuming by your output that your dataframe looks like

'(1,2,3)'
'(3,4,5)'
...

You can do:

from ast import literal_eval #python has an eval but it is not recommended due to potential security issues
df = pd.read_csv('database.csv') #no need to call pd.DataFrame, read_csv already does that
np.vstack(df.iloc[:,0].apply(literal_eval).values)

in addition to the other methods provided in the answers

Tarifazo
  • 4,118
  • 1
  • 9
  • 22
  • with vel_array = np.vstack(df_temp.iloc[:,0].apply(eval).values) , we get vel_array[0] = [a1, a2, a3], with no cotes, only values. And now it is possible to operate with the elements: vel_array[0][0] + vel_array[0][1] = some_number. – Diving Aug 20 '21 at 20:18
  • 2
    Note that calling eval on data you're reading from a file [can be dangerous](https://stackoverflow.com/q/661084/3888719). If one of your arrays was `"(1, os.remove('SOMETHING IMPORTANT', recursive=True), 3)"`, then you'd accidentally lose data with this approach – Michael Delgado Aug 20 '21 at 21:01
  • 1
    @Diving if you're going to go with this answer, please consider using the `ast` package's [`ast.literal_eval`](https://docs.python.org/3/library/ast.html#ast.literal_eval) rather than `eval`, as [it is *significantly* less dangerous (though not entirely safe) than just plain old `eval`](https://stackoverflow.com/q/15197673/3888719). – Michael Delgado Aug 23 '21 at 17:27
  • 2
    @MichaelDelgado good point. I changed the answer to reflect your comment – Tarifazo Aug 23 '21 at 17:32
1

Check out the powerful pandas string operators.

In this case, series.str.strip and series.str.split will do the trick:

In [11]: df['velocity'].str.strip('()').str.split(', ', expand=True).to_numpy()
Out[11]:
array([['a1', 'a2', 'a3'],
       ['b1', 'b2', 'b3'],
       ['c1', 'c2', 'c3'],
       ['z1', 'z2', 'z3']], dtype=object)

If your data is actually float types, you can add .astype(float) to convert the strings to float64:

In [12]: df['velocity'].str.strip('()').str.split(', ', expand=True).to_numpy().astype(float)
Out[12]:
array([[-7143.645 ,  -825.2191, -2463.361 ],
       [-7143.645 ,  -825.2191, -2463.361 ],
       [-7087.896 , -1058.8871, -2533.3374],
       [-7024.463 , -1291.3812, -2600.547 ],
       [-6953.418 , -1522.4622, -2664.9265]])

Performance considerations

Note that the vectorized string operators are significantly faster for larger arrays compared with a row-wise operation such as ast.literal_eval.

For an array with 10,000 rows and four columns:

In [23]: s = pd.DataFrame(np.random.random(size=(10000,4))).apply(lambda x: '({},{},{},{})'.format(x[0], x[1], x[2], x[3]), axis=1)

In [24]: s
Out[24]:
0       (0.9134272324343906,0.09784434338612968,0.1064...
1       (0.6171577052744037,0.552712839936354,0.684161...
2       (0.05253084132451025,0.6216173862765718,0.3920...
3       (0.39577548909770743,0.35020447632615737,0.632...
4       (0.4761450474353096,0.20003567087846386,0.2113...
                              ...
9995    (0.3618865364493633,0.4947066480156196,0.17413...
9996    (0.4083358148057057,0.09394431583700069,0.8712...
9997    (0.9466315666988651,0.4692990331960303,0.04969...
9998    (0.22868850839996946,0.4712850069678187,0.4834...
9999    (0.1525379507879958,0.6019087151036507,0.07105...
Length: 10000, dtype: object

the pandas string operators are more than 10 times faster

In [26]: %%timeit
    ...: np.vstack(s.apply(ast.literal_eval))
    ...:
    ...:
160 ms ± 13.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [27]: %%timeit
    ...: s.str.strip('()').str.split(', ', expand=True).to_numpy()
    ...:
    ...:
14.2 ms ± 704 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Michael Delgado
  • 13,789
  • 3
  • 29
  • 54
  • Hi @Michael_Delgado, thank your for your reply. I tried to use your suggestion: ``` vel = df['velocity'].str.strip('()').str.split(', ', expand=True).to_numpy() ``` but vel[0]=['a1', 'a2','a3'] , and vel[0][0] = 'a1' ; Those cotes are still a problem, because I can not operate with them. – Diving Aug 20 '21 at 20:05
  • 1
    Hi @Diving - if your data was numeric, you should specify so in your question. My answer leaves the answer as a string because in your example the data is a string. To convert the result I provide to a float, just add `.astype(float)` to the command. – Michael Delgado Aug 20 '21 at 20:42
  • 1
    Great, @Michael_Delgado ! Thank you again, I am learning. Good lesson. – Diving Aug 21 '21 at 00:19