0

I have a large .csv file which is constantly being updated in real time with several thousand lines displayed as follows:

 time1,stockA,bid,1
 time2,stockA,ask,1.1
 time3,stockB,ask,2.1
 time4,stockB,bid,2.0
 time5,stockA,bid,1.1
 time6,stockA,ask,1.2

What is the fastest way to read this into a dataframe that looks like this:

   time     stock       bid    ask
   time1    stockA      1      
   time2    stockA             1.1
   time3    stockB             2.1
   time4    stockB      2.0    
   time5    stockA      1.1
   time6    stockA             1.2

Any help is appreciated

qts
  • 984
  • 2
  • 14
  • 25
  • Show us what you've tried and where you're stuck – Soma May 13 '15 at 08:42
  • I've been thinking about this all day and it has me truly stumped :-(... The only way I can think of is to loop through each line. but that would take too long. – qts May 13 '15 at 08:55
  • Your update doesn't correctly show what happens with time5 and 6 rows, it's still unclear what the 1 and 2 really men with respect to bid columns – EdChum May 13 '15 at 09:31
  • okay updated again. hopefully it's clearer now. Each row is essentially providing a time, stock and price and telling you what type of price it is (e.g. bid, ask, last, high etc.). I'm now experimenting with a solution involving separating out the bids and asks and then merging them back. – qts May 13 '15 at 09:38

2 Answers2

1

You can use read_csv and specify header=None and pass the column names as a list:

In [124]:

t="""time1,stockA,bid,1
 time2,stockA,ask,1.1
 time3,stockB,ask,2.1
 time4,stockB,bid,2.0"""
​
df = pd.read_csv(io.StringIO(t), header=None, names=['time', 'stock', 'bid', 'ask'])
df
Out[124]:
     time   stock  bid  ask
0   time1  stockA  bid  1.0
1   time2  stockA  ask  1.1
2   time3  stockB  ask  2.1
3   time4  stockB  bid  2.0

You'll have to re-encode the bid column to 1 or 2:

In [126]:

df['bid'] = df['bid'].replace('bid', 1)
df['bid'] = df['bid'].replace('ask', 2)
df
Out[126]:
     time   stock  bid  ask
0   time1  stockA    1  1.0
1   time2  stockA    2  1.1
2   time3  stockB    2  2.1
3   time4  stockB    1  2.0

EDIT

Based on your updated sample data and desired output the following works:

In [29]:

t="""time1,stockA,bid,1
 time2,stockA,ask,1.1
 time3,stockB,ask,2.1
 time4,stockB,bid,2.0
 time5,stockA,bid,1.1
 time6,stockA,ask,1.2"""
​
df = pd.read_csv(io.StringIO(t), header=None, names=['time', 'stock', 'bid', 'ask'])
df
Out[29]:
     time   stock  bid  ask
0   time1  stockA  bid  1.0
1   time2  stockA  ask  1.1
2   time3  stockB  ask  2.1
3   time4  stockB  bid  2.0
4   time5  stockA  bid  1.1
5   time6  stockA  ask  1.2
In [30]:

df.loc[df['bid'] == 'bid', 'bid'] = df['ask']
df.loc[df['bid'] != 'ask', 'ask'] = ''
df.loc[df['bid'] == 'ask','bid'] = ''
df
Out[30]:
     time   stock  bid  ask
0   time1  stockA    1     
1   time2  stockA       1.1
2   time3  stockB       2.1
3   time4  stockB    2     
4   time5  stockA  1.1     
5   time6  stockA       1.2
EdChum
  • 376,765
  • 198
  • 813
  • 562
  • In my case the 'bid' column has different values for each stock, so I will not know what value to replace it with. – qts May 13 '15 at 08:52
  • Well **you** have to decide how to encode it and what they mean, it's a trivial operation if you need to do it – EdChum May 13 '15 at 09:09
  • I've edited my example a bit to show how this might not work. Or am I just misunderstanding..? – qts May 13 '15 at 09:28
  • I've added an extra dimension to the problem to see how your code can be amended to address it. – qts May 13 '15 at 10:51
  • @julianlim sorry but if you have another problem post another question, incrementally increasing the scope of the question on SO is not good practice – EdChum May 13 '15 at 10:51
  • @julianlim I can answer your question but you should post another question, basically it will also handle this case better actually – EdChum May 13 '15 at 10:55
  • oh sorry. I'm quite new to his and thought it was advisable in order to keep the related topic together. I will post as a separate question. – qts May 13 '15 at 11:03
1

Here is a more concise way I think.

 df = pd.read_csv('prices.csv', header=None, names=['time', 'stock', 'type',   'prices'], 
                  index_col=['time', 'stock', 'type'])

In [1062]:

df
Out[1062]:
                    prices
time    stock   type    
time1   stockA  bid 1.0
time2   stockA  ask 1.1
time3   stockB  ask 2.1
time4   stockB  bid 2.0
time5   stockA  bid 1.1
time6   stockA  ask 1.2
time7   stockA  high1.5
time8   stockA  low 0.5

I think that's what the DataFrame should look like. Then do

In [1064]:

df.unstack()
Out[1064]:
                prices
type            ask bid high low
time    stock               
time1   stockA  NaN 1.0 NaN NaN
time2   stockA  1.1 NaN NaN NaN
time3   stockB  2.1 NaN NaN NaN
time4   stockB  NaN 2.0 NaN NaN
time5   stockA  NaN 1.1 NaN NaN
time6   stockA  1.2 NaN NaN NaN
time7   stockA  NaN NaN 1.5 NaN
time8   stockA  NaN NaN NaN 0.5

You can fill the Nans with whatever you prefer using df.fillna. Generally speaking, converting a columns values into column headers is called pivoting. .unstack pivots a level of a MultiIndex. You can check .pivot as well.

JoeCondron
  • 8,546
  • 3
  • 27
  • 28