-1

I'm working with a crypto-currency data sample, each cell contains a dictionary. The dictionary containing the open price, close price, highest price, lowest price, volume and market cap. The columns are the corresponding dates and the index is the name of each cryptocurrency.

I don't know how to prepare the data in order for me to find the correlation between different currencies and between highest price and volume for example. How can this be done in python (pandas)...also how would I define a date range in such a situation?

Here's a link to the data sample, my coding and a printout of the data (Access is OPEN TO PUBLIC): https://drive.google.com/open?id=1mjgq0lEf46OmF4zK8sboXylleNs0zx7I

1 Answers1

1

To begin with, I would suggest rearranging your data so that each currency's OHLCV values are their own columns (e.g. "btc_open | btc_high" etc.). This makes generating correlation matrices far easier. I'd also suggest beginning with only one metric (e.g. close price) and perhaps period movement (e.g. close-open) in your analysis. To answer your question:

Pandas can return a correlation matrix of all columns with:

df.corr()

If you want to use only specific columns, select those from the DataFrame:

df[["col1", "col2"]].corr()

You can return a single correlation value between two columns with the form:

df["col1"].corr(df["col2"])

If you'd like to specify a specific date range, I'd refer you to this question. I believe this will require your date column or index to be of the type datetime. If you don't know how to work with or convert to this type, I would suggest consulting the pandas documentation (perhaps begin with pandas.to_datetime).

In future, I would suggest including a data snippet in your post. I don't believe Google Drive is an appropriate form to share data, and it definitely is not appropriate to set the data to "request access".

EDIT: I checked your data and created a smaller subset to test this method on. If there are imperfections in the data you may find problems, but I had none when I tested it on a sample of your first 100 days and 10 coins (after transposing, df.iloc[:100, :10].

Firstly, transpose the DataFrame so columns are organised by coin and rows are dates.

df = df.T

Following this, we concatenate to a new DataFrame (result). Alternatively, concatenate to the original and drop columns after. Unfortunately I can't think of a non-iterative method. This method goes column by column, creates a DataFrame for each coins, adds the coin name prefix to the column names, then concatenates each DataFrame to the end.

result = pd.DataFrame()
coins = df.columns.tolist()
for coin in coins:
    coin_data = df[coin]
    split_coin = coin_data.apply(pd.Series).add_prefix(coin+"_")
    result = pd.concat([result, split_coin], axis=1)
NaT3z
  • 344
  • 4
  • 13
  • Thanks for the answer. How exactly would I rearrange the data to make the OHLCV of each currency their own columns, that's exactly what I want to do, but can't. Currently each cell contains a dictionary (OHLCV). – Masange Matiwane Jan 24 '19 at 13:47
  • Please provide a sample of the data from top down (so I can see the structure of the dictionary) and I'll try to give you some guidance on how to put it into a DataFrame. – NaT3z Jan 25 '19 at 08:43
  • You can access the data printout, data sample and my code on here (access is open): https://drive.google.com/open?id=1mjgq0lEf46OmF4zK8sboXylleNs0zx7I – Masange Matiwane Jan 25 '19 at 09:36
  • Check my edit. Your dataset is quite large, hopefully the method works for you. I'm certain the process will work at least on a subset of your data. For any regression analysis you can always use my above method but simply create a list of the columns you want (coin names) rather than the entire DataFrame. Unfortunately you won't be able to perform meaningful multiple regression analysis on this set given the enormous number of predictors anyway so I would certainly suggest simplifying both which coins you analyse, and which pricing metric you use (OHLC). – NaT3z Jan 25 '19 at 11:24
  • Thank you very much NaT3z!... I've done as suggested, but I get an AtributeError when I run the last code: result.concat([result, split_coin], axis=1)... AttributeError: 'DataFrame' object has no attribute 'concat' – Masange Matiwane Jan 25 '19 at 13:28
  • Sorry Masange, that's an error on my behalf when retyping my code. It should be result = pd.concat([result, split_coin], axis=1). I've edited my answer for you. – NaT3z Jan 25 '19 at 23:04
  • Thanks again... when I run the correlation code: df["col1"].corr(df["col2"]) I get the TypeError: unsupported operand type(s) for +: 'dict' and 'dict' ... and also the code: df[["col1", "col2"]].corr() returns an empty output (just an underscore). – Masange Matiwane Jan 28 '19 at 12:37
  • You'd best check for imperfections in the columns you're trying to find the correlation between. One may have the incorrect type, or df.corr() may not play well with NaN values. I did my code above on a dataset without NaN (df.dropna(how="any", axis=1)). Then I tried result_corr = result.corr(). This returned a 1206x1206 correlation DataFrame with no issues. This tells me that a column with NaN values is your problem. I don't know if the NaN values specifically are causing the TypeError. To sort for only numeric, you can use df.select_dtypes(include=["number"]). – NaT3z Jan 29 '19 at 12:01
  • You're right, the issue seems to be the NaN values. Thanks alot! :) – Masange Matiwane Jan 30 '19 at 15:59
  • Hey NaT3z! I want to to create time series correlation matrices for a period of 1 month, and then step forward say 2 days at a time. For the top 20 coins (by market cap). Any suggestions? – Masange Matiwane Jan 31 '19 at 15:43