3

I have a CSV with two delimiters (;) and (,) it looks like this:

vin;vorgangid;eventkm;D_8_lamsoni_w_time;D_8_lamsoni_w_value
V345578;295234545;13;-1000.0,-980.0;7.9921875,11.984375
V346670;329781064;13;-960.0,-940.0;7.9921875,11.984375

I want to import it into a pandas data frame, with the (;) acting as a column separator and (,) as a separator for a list or array using float as data type. So far I am using this method, but I am sure there is something easier out there.

aa=0;
csv_import=pd.read_csv(folder+FileName, ';')
for col in csv_import.columns:
aa=aa+1
if type(csv_import[col][0])== str and aa>3:
    # string to list of strings
    csv_import[col]=csv_import[col].apply(lambda x:x.split(','))
    # make the list of stings into a list of floats
    csv_import[col]=csv_import[col].apply(lambda x: [float(y) for y in x])
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
valenzio
  • 773
  • 2
  • 9
  • 21

3 Answers3

4

Asides from the other fine answers here, which are more pandas-specific, it should be noted that Python itself is pretty powerful when it comes to string processing. You can just place the result of replacing ';' with ',' in a StringIO object, and work normally from there:

In [8]: import pandas as pd

In [9]: from cStringIO import StringIO

In [10]: pd.read_csv(StringIO(''.join(l.replace(';', ',') for l in open('stuff.csv'))))
Out[10]: 
                   vin  vorgangid  eventkm  D_8_lamsoni_w_time  \
V345578 295234545   13    -1000.0   -980.0            7.992188   
V346670 329781064   13     -960.0   -940.0            7.992188   

                   D_8_lamsoni_w_value  
V345578 295234545            11.984375  
V346670 329781064            11.984375  
Ami Tavory
  • 74,578
  • 11
  • 141
  • 185
  • Interesting approach, how does this work in Python 3.5? import io pd.read_csv(io(''.join(l.replace(';', ',') for l in open('stuff.csv')))) does not work – valenzio Sep 14 '16 at 12:29
  • @valenzio I just checked this on 3.5.2, and it's exactly the same. – Ami Tavory Sep 14 '16 at 12:31
  • I get the massage, No module named 'cStringIO' do I have to install that module via pip, I had a feeling it comes with python – valenzio Sep 14 '16 at 12:50
  • @valenzio Sorry - bad phrasing on my part. The code is the same, but the import is different. You need to [`from io import StringIO`](http://stackoverflow.com/questions/11914472/stringio-in-python3). – Ami Tavory Sep 14 '16 at 12:52
  • Thank you it works now, but for some reason the column names are shifted and for each entry you get a single column. Your approach is the closest solution I was thinking of. I will try and undstand the syntax and maybe with some manipulation I will find the solution. – valenzio Sep 14 '16 at 13:29
  • @valenzio OK, good luck. Feel free to post, if you find something interesting (by me, it seems to work, but there's always something to learn). – Ami Tavory Sep 14 '16 at 13:31
3

first read CSV using ; as a delimiter:

df = pd.read_csv(filename, sep=';')

UPDATE:

In [67]: num_cols = df.columns.difference(['vin','vorgangid','eventkm'])

In [68]: num_cols
Out[68]: Index(['D_8_lamsoni_w_time', 'D_8_lamsoni_w_value'], dtype='object')

In [69]: df[num_cols] = (df[num_cols].apply(lambda x: x.str.split(',', expand=True)
   ....:                                               .stack()
   ....:                                               .astype(float)
   ....:                                               .unstack()
   ....:                                               .values.tolist())
   ....:                )

In [70]: df
Out[70]:
       vin  vorgangid  eventkm D_8_lamsoni_w_time     D_8_lamsoni_w_value
0  V345578  295234545       13  [-1000.0, -980.0]  [7.9921875, 11.984375]
1  V346670  329781064       13   [-960.0, -940.0]  [7.9921875, 11.984375]

In [71]: type(df.loc[0, 'D_8_lamsoni_w_value'][0])
Out[71]: float

OLD answer:

Now we can split numbers into lists in the "number" columns:

In [20]: df[['D_8_lamsoni_w_time',  'D_8_lamsoni_w_value']] = \
    df[['D_8_lamsoni_w_time',  'D_8_lamsoni_w_value']].apply(lambda x: x.str.split(','))
In [21]: df
Out[21]:
       vin  vorgangid  eventkm D_8_lamsoni_w_time     D_8_lamsoni_w_value
0  V345578  295234545       13  [-1000.0, -980.0]  [7.9921875, 11.984375]
1  V346670  329781064       13   [-960.0, -940.0]  [7.9921875, 11.984375]
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
  • Thank you for the suggestion. I have 2 comments: 1) I guess line [20] can be packed into a for loop if you have way more than 2 columns, but than it becomes basically the same as my approach, no? 2) I think at the end you still end up with a list of strings e.g. type(df.ix[0,4][0]) == 'str' – valenzio Sep 14 '16 at 11:46
  • @valenzio, please check the UPDATE section - i've completely rewritten the code and addressed both your questions – MaxU - stand with Ukraine Sep 14 '16 at 13:25
  • Thanks mate, this looks much better, but you would still need two for loops if you have like 100 entries for line 67 and 68, I was looking for a solution similar what Ami Tavory suggested. Basically one import command with the right parameters – valenzio Sep 14 '16 at 13:35
  • @valenzio, no, no additional loops are needed – MaxU - stand with Ukraine Sep 14 '16 at 13:37
  • @ MaxU, well think about this way: the data I provided is just a small portion to make it more readable. In reality I have over 100 columns with different names, so I would need on for loop to create 'num_cols' and one to assign the different columns with the new data type. – valenzio Sep 14 '16 at 14:16
  • @valenzio, that's the point! `df.columns.difference(['vin','vorgangid','eventkm'])` - will give you __ALL__ columns except `['vin','vorgangid','eventkm']` and `df[num_cols] = (df[num_cols].apply(...)` will process __ALL__ columns (in the `num_cols` list) in one step – MaxU - stand with Ukraine Sep 14 '16 at 14:24
2

You can use parameter converters in read_csv and define custom function for spliting:

def f(x):
    return [float(i) for i in x.split(',')]

#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), 
                 sep=";", 
                 converters={'D_8_lamsoni_w_time':f, 'D_8_lamsoni_w_value':f})
print (df)
       vin  vorgangid  eventkm D_8_lamsoni_w_time     D_8_lamsoni_w_value
0  V345578  295234545       13  [-1000.0, -980.0]  [7.9921875, 11.984375]
1  V346670  329781064       13   [-960.0, -940.0]  [7.9921875, 11.984375]

Another solution working with NaN in 4. and 5. columns:

You can use read_csv with separators ;, then apply str.split to 4. and 5. column selected by iloc and convert each value in list to float:

import pandas as pd
import numpy as np
import io

temp=u"""vin;vorgangid;eventkm;D_8_lamsoni_w_time;D_8_lamsoni_w_value
V345578;295234545;13;-1000.0,-980.0;7.9921875,11.984375
V346670;329781064;13;-960.0,-940.0;7.9921875,11.984375"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), sep=";")

print (df)
       vin  vorgangid  eventkm D_8_lamsoni_w_time  D_8_lamsoni_w_value
0  V345578  295234545       13     -1000.0,-980.0  7.9921875,11.984375
1  V346670  329781064       13      -960.0,-940.0  7.9921875,11.984375

#split 4.th and 5th column and convert to numpy array
df.iloc[:,3] = df.iloc[:,3].str.split(',').apply(lambda x: [float(i) for i in x])
df.iloc[:,4] = df.iloc[:,4].str.split(',').apply(lambda x: [float(i) for i in x])
print (df)
       vin  vorgangid  eventkm D_8_lamsoni_w_time     D_8_lamsoni_w_value
0  V345578  295234545       13  [-1000.0, -980.0]  [7.9921875, 11.984375]
1  V346670  329781064       13   [-960.0, -940.0]  [7.9921875, 11.984375]

If need numpy arrays instead lists:

#split 4.th and 5th column and convert to numpy array
df.iloc[:,3] = df.iloc[:,3].str.split(',').apply(lambda x: np.array([float(i) for i in x]))
df.iloc[:,4] = df.iloc[:,4].str.split(',').apply(lambda x: np.array([float(i) for i in x]))
print (df)
       vin  vorgangid  eventkm D_8_lamsoni_w_time     D_8_lamsoni_w_value
0  V345578  295234545       13  [-1000.0, -980.0]  [7.9921875, 11.984375]
1  V346670  329781064       13   [-960.0, -940.0]  [7.9921875, 11.984375]

print (type(df.iloc[0,3]))
<class 'numpy.ndarray'>

I try improve your solutiuon:

a=0;
csv_import=pd.read_csv(folder+FileName, ';')
for col in csv_import.columns:
    a += 1
    if type(csv_import.ix[0, col])== str and a>3:
        # string to list of strings
        csv_import[col]=csv_import[col].apply(lambda x: [float(y) for y in x.split(',')])
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Thank you for your input, but correct me if I am wrong, this looks more tedious than my approach. Cheers – valenzio Sep 14 '16 at 11:56
  • Ok, I try improve your solution, please check last paragraph in my answer. – jezrael Sep 14 '16 at 12:28
  • looking good, but I guess there is no way to avoid the for loop, I just thought there is a way to initialize the import in some way that it automatically knows that (',') seperated values are an array – valenzio Sep 14 '16 at 12:33
  • No, I think you need my first solution with `formaters` if need convert to list some columns. – jezrael Sep 14 '16 at 12:36