1

I have two questions concerning a large csv file which contains data in the following way formatted as strings:

 "XAU=,XAU=,XAG=,XAG="  
 "25/08/2014 6:00:05,1200.343,25/08/2014 6:00:03,19.44,"            
 "25/08/2014 6:00:05,1200,,,"

Is there a way to efficiently load this into a pandas dataframe object? Alternatively, also multiple pandas Series objects would do the job. So far I tried:

df = read_csv(path, header=None)
df[0].str[0:-1].str.split(',', return_type='frame')

The second line is an answer from this thread pandas split string into columns. However, I wonder if there is an even better way, especially since I have different data types? Secondly, how can I correctly parse the dates with to_datetime(). I tried to reindex df and used df.index = df.index.to_datetime(). This worked only half way because it did not strictly keep the dd/mm/yyyy ... format. Some dates were incorrectly parsed as mm/dd/yyyy .... I'm looking for fast ways because eventually I will loop over many such csv's. Thx for any help!

EDIT: Ideally data in this form should be handled as well:

     "XAU=,XAU=,XAG=,XAG="  
     "25/08/2014 6:00:05,1200.343,25/08/2014 6:00:03,19.44,"            
     ",,25/08/2014 6:00:05,19.50,"

So with the answer provided below,

    data = StringIO(
    '''
    "XAU=,XAU=,XAG=,XAG="  
    "25/08/2014 6:00:05,1200.343,25/08/2014 6:00:03,19.44,"            
    ",,25/08/2014 6:00:05,19.5,"
    ''')

df would become:

                  XAU                    XAU                       XAG      XAG
0  25/08/2014 6:00:05               1200.343        25/08/2014 6:00:03    19.44
1                   ,     25/08/2014 6:00:05                      19.5       \n
Community
  • 1
  • 1
Tim
  • 125
  • 3
  • 12

1 Answers1

1

You can preprocess everything inside the read_csv as shown:

import csv

data = StringIO(
'''
"XAU=,XAU=,XAG=,XAG="  
"25/08/2014 6:00:05,1200.343,25/08/2014 6:00:03,19.44,"            
"25/08/2014 6:00:05,1200,,,"
''')

df = pd.read_csv(data, quoting=csv.QUOTE_NONE, index_col=False, escapechar='"',   \
                 parse_dates=[0, 2]).rename(columns=lambda x: x.split("=")[0])
df

Image

df.dtypes

XAU    datetime64[ns]
XAU           float64
XAG    datetime64[ns]
XAG           float64
dtype: object

Break-up:

quoting=csv.QUOTE_NONE : Instructs writer objects to never quote fields

index_col=False: Do not use the first column as the index

escapechar=": string used to escape delimiter

parse_dates=[0, 2]: convert columns(0,2 → order) to datetime objects


To read a subset of the columns, you can do so with the help of usecols by supplying appropriate integer indices as shown:

df = pd.read_csv(data, quoting=csv.QUOTE_NONE, index_col=False, escapechar='"',   \
                 parse_dates=[0], usecols=[0,1]).rename(columns=lambda x: x.split("=")[0])

df

Image

If you want to convert the two columns of XAU into a series object:

df.columns = df.columns + [str('_%d'%(i)) for i in list(range(len(df.columns)))]

ser = pd.Series(data=df['XAU_1'].values, index=df['XAU_0'].values, name='XAU')
ser

2014-08-25 06:00:05    1200.343
2014-08-25 06:00:05    1200.000
Name: XAU, dtype: float64

type(ser)
pandas.core.series.Series

The reason that fails for a newer case is because escapechar skips the first delimiter, as a result the empty strings aren't captured properly.

If that's the case, you are better off ignoring escapechar altogether and continuing as shown:

For the combination of old and new data:

data = StringIO(
'''
"XAU=,XAU=,XAG=,XAG="   
"25/08/2014 6:00:05,1200.343,25/08/2014 6:00:03,19.44," 
"25/08/2014 6:00:05,1200,,,"
",,25/08/2014 6:00:05,19.50,"       
''')


df = pd.read_csv(data, quoting=csv.QUOTE_NONE, index_col=False, na_values=[""], 
                 parse_dates=[2]).rename(columns=lambda x: x.strip('"').split("=")[0])

old_cols = df.columns
# Index(['XAU', 'XAU', 'XAG', 'XAG'], dtype='object')

new_cols = [col[0] for col in list(enumerate(df.columns))]
# [0, 1, 2, 3]
df.columns = new_cols

# Converting first column to datetime dtype
df[0] = pd.to_datetime(df[0].str.replace('"', ''))   
df.columns = old_cols

df

Image

df.dtypes

XAU    datetime64[ns]
XAU           float64
XAG    datetime64[ns]
XAG           float64
dtype: object
Nickil Maveli
  • 29,155
  • 8
  • 82
  • 85
  • Just curious, would it be possible to only read, let's say, the two "XAU" columns in a dataframe or directly in a pandas Series object? – Tim Sep 14 '16 at 20:22
  • Yes, it is possible to select a limited number of columns to be read into the dataframe(*See edited response*). – Nickil Maveli Sep 15 '16 at 07:18
  • Unfortunately I encountered a subtle problem. When the data arrives in the following form (slightly diff.): ''' "XAU=,XAU=,XAG=,XAG=" "25/08/2014 6:00:05,1200.343,25/08/2014 6:00:03,19.44," ",,25/08/2014 6:00:05,19.50," ''' the df will contain a semicolon as a value for seond row and first column. As a result, the data won't be alligned properly. – Tim Sep 19 '16 at 16:57
  • Consequently, the date parsing fails as well. – Tim Sep 19 '16 at 16:58
  • Ps. I edited my question at the end. So it's just a slight change of the incoming data string. – Tim Sep 19 '16 at 17:05
  • The code has been modified to suit such purposes. *See edit* – Nickil Maveli Sep 19 '16 at 20:23
  • 1
    That's great, thank you a lot. It works fine now. Only as a last remark, I wonder if it's also possible to specify the dtype of the price quotes by column index, such that pandas does's need to infer it? Generally speaking, is it wise to read data from such "string like" format (this is how it's provided to me)? – Tim Sep 20 '16 at 14:51
  • Yes, you could specify the `dtype` during reading operation itself by passing a `dict` of column names/column order as keys and the types as values of the key-value pair. Like, `dtype={1:np.float64. 3:np.float64}`, where 1 and 3 correspond to the price columns. – Nickil Maveli Sep 20 '16 at 15:23
  • It's better if you could strip off those double quotes as it does not give any additional information and only leads to confusion as we just saw earlier. Read about `csv.reader` which you could use to make such modifications prior to using `read_csv` module of `pandas`. – Nickil Maveli Sep 20 '16 at 15:27
  • Indeed, I can read and collect each row as a "full string" in a list and perform the following: `data = StringIO("\n".join(ListWithRowsAsStrings))` – Tim Sep 20 '16 at 17:25