0

I have a dataset in .xlsx with hundreds of thousands of rows as follow:

slug    symbol  name    date    ranknow open    high    low close   volume  market  close_ratio spread
companyA    AAA companyA    28/04/2013  1   135,3   135,98  132,1   134,21  0   1500520000  0,5438  3,88
companyA    AAA companyA    29/04/2013  1   134,44  147,49  134 144,54  0   1491160000  0,7813  13,49
companyA    AAA companyA    30/04/2013  1   144 146,93  134,05  139 0   1597780000  0,3843  12,88
....
companyA    AAA companyA    17/04/2018  1   8071,66 8285,96 7881,72 7902,09 6900880000  1,3707E+11  0,0504  404,24
....
lancer  LA  Lancer  09/01/2018  731 0,347111    0,422736    0,345451    0,422736    3536710 0   1   0,08
lancer  LA  Lancer  10/01/2018  731 0,435794    0,512958    0,331123    0,487106    2586980 0   0,8578  0,18
lancer  LA  Lancer  11/01/2018  731 0,479738    0,499482    0,309485    0,331977    950410  0   0,1184  0,19
....
lancer  LA  Lancer  17/04/2018  731 0,027279    0,041106    0,02558 0,031017    9936    1927680 0,3502  0,02
....
yocomin YC  Yocomin 21/01/2016  732 0,008135    0,010833    0,002853    0,002876    63  139008  0,0029  0,01
yocomin YC  Yocomin 22/01/2016  732 0,002872    0,008174    0,001192    0,005737    69  49086   0,651   0,01
yocomin YC  Yocomin 23/01/2016  732 0,005737    0,005918    0,001357    0,00136     67  98050   0,0007  0
....
yocomin YC  Yocomin 17/04/2018  732 0,020425    0,021194    0,017635    0,01764 12862   2291610 0,0014  0
....

Let's say I have a .txt file with a list of symbol of that time series I want to extract. For example:

AAA
LA
YC

I would like to get a dataset that would look as follow:

date          AAA        LA        YC
28/04/2013    134,21     NaN       NaN
29/04/2013    144,54     NaN       NaN
30/04/2013    139        NaN       NaN
....
....
....
17/04/2018    7902,09    0,031017  0,01764

where under the stock name (like AAA, etc) i get the "close" price. I'm open to both Python and R. Any help would be grate!

toyo10
  • 121
  • 1
  • 14

1 Answers1

1

In python using pandas, this should work.

import pandas as pd

df = pd.read_excel("/path/to/file/Book1.xlsx")
df = df.loc[:, ['symbol', 'name', 'date', 'close']]
df = df.set_index(['symbol', 'name', 'date'])
df = df.unstack(level=[0,1])
df = df['close']

to read the symbols file file and then filter out symbols not in the dataframe:

symbols = pd.read_csv('/path/to/file/symbols.txt', sep=" ", header=None)
symbols = symbols[0].tolist()
symbols = pd.Index(symbols).unique()
symbols = symbols.intersection(df.columns.get_level_values(0))

And the output will look like:

print(df[symbols])


symbol                   AAA        LA        YC
name                companyA    Lancer   Yocomin
date                                            
2018-09-01 00:00:00     None  0,422736      None
2018-10-01 00:00:00     None  0,487106      None
2018-11-01 00:00:00     None  0,331977      None
Troy D
  • 2,093
  • 1
  • 14
  • 28
  • Thank you very much for your answer. However, an error appears: "ValueError: Index contains duplicate entries, cannot reshape". Maybe because my .txt file can contain some symbols actually not present in the symbol column. – toyo10 Apr 24 '18 at 17:06
  • The problem is that in the dataset (df in your example) we don't have unique "symbols", therefore `df.unstack()` fails. Sometimes we have 2 different AAA series for examples. Maybe can be useful to attach a 2,3,etc. in case we have duplicates to avoid the `ValueError`. Any idea? – toyo10 Apr 25 '18 at 10:14
  • So you're saying there may be multiple entries for the same date and symbol pair? What if we index on the symbol and name? Does that create a unique column identifier? Check the edit to the answer to see if that fixes it. – Troy D Apr 25 '18 at 21:20
  • Thank you very much it works!! The only things is that I get a `KeyError` in `print(df[symbols])` because some of my data in the `.txt` file are not present in `df['close']`. – toyo10 Apr 26 '18 at 11:36
  • Updated the answer to filter out symbols not in the dataframe. Let me know if that fixes it. – Troy D Apr 26 '18 at 16:18
  • Thank you @Troy D. Unfortunately if I do it `print(df[symbols])` gives me the following error: `NotImplementedError: Index._join_level on non-unique index is not implemented`. – toyo10 Apr 26 '18 at 16:55
  • Ok, sounds like your txt file has duplicate symbols. Add .unique() to filter out duplicates. – Troy D Apr 26 '18 at 18:51
  • Thank you so much! – toyo10 Apr 27 '18 at 15:19