1

I have just started learning Machine learning and Scikit. I have been watching a tutorial in which the person used Quandl to fetch data for google stock prices. As far as I have researched, Quandl.get returns pandas dataframe. What's confusing about this dataframe for me is, a piece of code is adding columns in second dimension of the dataframe and on another line the tutor is accessing the same column using the FIRST dimension of the dataframe. How is that possible? What's going on with this dataframe?

df = quandl.get('WIKI/GOOGL')

df = df[['Adj. Open','Adj. High','Adj. Low','Adj. Close','Adj. Volume']]

df['HCL_PCT'] = (df['Adj. Close'] - df['Adj. Open']) / df['Adj. Open'] # how is df['Adj. Open'] working?? Wasn't 'Adj. Open' added in the second dimension of the dataframe in the second line of the code above??

My goal is to learn Tensorflow and have a little bit of knowledge about Machine learning slangs and concepts before I dive into TensorFlow.

user2498079
  • 2,872
  • 8
  • 32
  • 60

2 Answers2

0

I add df.head() to write output for showing data:

#read data
df = quandl.get('WIKI/GOOGL')
print (df.head())
              Open    High     Low    Close      Volume  Ex-Dividend  \
Date                                                                   
2004-08-19  100.01  104.06   95.96  100.335  44659000.0          0.0   
2004-08-20  101.01  109.08  100.50  108.310  22834300.0          0.0   
2004-08-23  110.76  113.48  109.05  109.400  18256100.0          0.0   
2004-08-24  111.24  111.60  103.57  104.870  15247300.0          0.0   
2004-08-25  104.76  108.00  103.88  106.000   9188600.0          0.0   

            Split Ratio  Adj. Open  Adj. High   Adj. Low  Adj. Close  \
Date                                                                   
2004-08-19          1.0  50.159839  52.191109  48.128568   50.322842   
2004-08-20          1.0  50.661387  54.708881  50.405597   54.322689   
2004-08-23          1.0  55.551482  56.915693  54.693835   54.869377   
2004-08-24          1.0  55.792225  55.972783  51.945350   52.597363   
2004-08-25          1.0  52.542193  54.167209  52.100830   53.164113   

            Adj. Volume  
Date                     
2004-08-19   44659000.0  
2004-08-20   22834300.0  
2004-08-23   18256100.0  
2004-08-24   15247300.0  
2004-08-25    9188600.0  

#select data by columns (filter) and set order of columns 
df = df[['Adj. Open','Adj. High','Adj. Low','Adj. Close','Adj. Volume']]
print (df.head())
            Adj. Open  Adj. High   Adj. Low  Adj. Close  Adj. Volume
Date                                                                
2004-08-19  50.159839  52.191109  48.128568   50.322842   44659000.0
2004-08-20  50.661387  54.708881  50.405597   54.322689   22834300.0
2004-08-23  55.551482  56.915693  54.693835   54.869377   18256100.0
2004-08-24  55.792225  55.972783  51.945350   52.597363   15247300.0
2004-08-25  52.542193  54.167209  52.100830   53.164113    9188600.0

#count data - select by columns
df['HCL_PCT'] = (df['Adj. Close'] - df['Adj. Open']) / df['Adj. Open']
print (df.head())
            Adj. Open  Adj. High   Adj. Low  Adj. Close  Adj. Volume   HCL_PCT
Date                                                                          
2004-08-19  50.159839  52.191109  48.128568   50.322842   44659000.0  0.003250
2004-08-20  50.661387  54.708881  50.405597   54.322689   22834300.0  0.072270
2004-08-23  55.551482  56.915693  54.693835   54.869377   18256100.0 -0.012279
2004-08-24  55.792225  55.972783  51.945350   52.597363   15247300.0 -0.057264
2004-08-25  52.542193  54.167209  52.100830   53.164113    9188600.0  0.011837

Select column Adj. Close:

print (df['Adj. Close'])
Date
2004-08-19     50.322842
2004-08-20     54.322689
2004-08-23     54.869377
2004-08-24     52.597363
2004-08-25     53.164113
2004-08-26     54.122070
2004-08-27     53.239345
2004-08-30     51.162935
2004-08-31     51.343492
2004-09-01     50.280210
2004-09-02     50.912161
2004-09-03     50.159839
2004-09-07     50.947269
2004-09-08     51.308384
2004-09-09     51.313400
2004-09-10     52.828075
2004-09-13     53.916435
2004-09-14     55.917612
2004-09-15     56.173402
2004-09-16     57.161452
2004-09-17     58.926902
2004-09-20     59.864797
2004-09-21     59.102444
2004-09-22     59.373280
2004-09-23     60.597057
2004-09-24     60.100525
2004-09-27     59.313094
2004-09-28     63.626409
2004-09-29     65.742942
2004-09-30     65.000651

2017-04-13    840.180000
2017-04-17    855.130000
2017-04-18    853.990000
2017-04-19    856.510000
2017-04-20    860.080000
2017-04-21    858.950000
2017-04-24    878.930000
2017-04-25    888.840000
2017-04-26    889.140000
2017-04-27    891.440000
2017-04-28    924.520000
2017-05-01    932.820000
2017-05-02    937.090000
2017-05-03    948.450000
2017-05-04    954.720000
2017-05-05    950.280000
2017-05-08    958.690000
2017-05-09    956.710000
2017-05-10    954.840000
2017-05-11    955.890000
2017-05-12    955.140000
2017-05-15    959.220000
2017-05-16    964.610000
2017-05-17    942.170000
2017-05-18    950.500000
2017-05-19    954.650000
2017-05-22    964.070000
2017-05-23    970.550000
2017-05-24    977.610000
2017-05-25    991.860000
Name: Adj. Close, Length: 3215, dtype: float64

EDIT:

df = pd.DataFrame({'A':[1,2,3],
                   'D':[4,5,6],
                   'B':[7,8,9],
                   'F':[1,3,5],
                   'C':[5,3,6]})

print (df)
   A  B  C  D  F
0  1  7  5  4  1
1  2  8  3  5  3
2  3  9  6  6  5

#select only columns A,B,C and return new dataframe in new order of columns
df1 = df[['A','B','C']]
print (df1)
   A  B  C
0  1  7  5
1  2  8  3
2  3  9  6

#select only columns A,B,C and return new dataframe in new order of columns
df2 = df[['C','A','B']]
print (df2)
   C  A  B
0  5  1  7
1  3  2  8
2  6  3  9
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Shouldn't df['Adj. Open'] return garbage value or throw error altogether because it was declared in the second dimension of the Pandas DataFrame list on the second line of the code? – user2498079 May 26 '17 at 10:25
  • No, it only select data - return `Series` (column). I can multiple select it. – jezrael May 26 '17 at 10:26
  • Second column only select all possible columns and assign back to df – jezrael May 26 '17 at 10:27
  • and third select only 3 columns separately for subtract and division. – jezrael May 26 '17 at 10:28
  • I don't understand. `df[['Adj. Open','Adj. High','Adj. Low','Adj. Close','Adj. Volume']]` is declaring `Adj. Open` in the `2nd` dimension of the dataframe. So how can we access `Adj. Open` like `df[Adj. Open]`. It makes no sense – user2498079 May 26 '17 at 10:28
  • No, you are wrong. It only reorder columns, not set `2d` division. – jezrael May 26 '17 at 10:29
  • It is same as [this answer](https://stackoverflow.com/a/23741480/2901002) – jezrael May 26 '17 at 10:30
  • Coming from Java, Javascript, C I have learned that this way a 2nd dimension list/array is defined. Can you please point me to a simple tutorial that explains how this `df[['Adj. Open','Adj. High','Adj. Low','Adj. Close','Adj. Volume']]` notation is working – user2498079 May 26 '17 at 10:30
  • Maybe help [this](http://pandas.pydata.org/pandas-docs/stable/indexing.html#basics) - maybe from `In [9]: df` to `Attribute Access`. – jezrael May 26 '17 at 10:32
  • Can you please point me to a tutorial that shows how df[ [] ] is handled behind the scenes for reordering in Simple PYTHON without Pandas? – user2498079 May 26 '17 at 10:36
  • Sorry, but it is impossible get reoerdering in python. Becasue if use `df[[ ]]` call pandas function - it is called subset - it simply select only specified columns and return new dataframe with new order. – jezrael May 26 '17 at 11:05
  • I try add to answer another sample, maybe it help you better understanding it. – jezrael May 26 '17 at 11:15
0

index : Index or array-like

in Dataframe structure, the use of an index to obtain a column, the use of an array or a number of queues, the equivalent of df [:, []](all selected elements, column elements slice access)

chenhong
  • 174
  • 2
  • 6