0

I have a function that parses json and then loads it into a dataframe. My general approach was to loop through each file and then concat it to the existing DF that was adding all the entries over time:

HoldingsDF = loadFundETFData(ticker1, fundsDict[ticker1])
holdingsFullDF = pd.concat([holdingsFullDF, HoldingsDF])

The problem is this loads 30k files and it taking over 8 hours. So I tried this:

HoldingsDF = loadFundETFData(ticker1, fundsDict[ticker1])
holdingsFullList.append(HoldingsDF.to_dict())

then once the loop is done I tried to combine it via:

holdingsFullDF = pd.DataFrame.from_records(holdingsFullList)

(I'm tried just loading the df and doing it from_dict as well). I'm getting this output:

Column1
{0: 'IBM', 1: 'google'..

I expect:

Column1
IBM
Google
..

My concat per loop worked(but took 8 hours), this new approach is done within a min but is not loading it correctly(same problem on all columns). What am I doing wrong when loading the data?

Lostsoul
  • 25,013
  • 48
  • 144
  • 239
  • Depending on how exactly are you looping and concatenating. It is best to append your dataframes to a list and perform the concatenate after looping. See this post (https://stackoverflow.com/a/36489724/6361531) – Scott Boston Dec 16 '20 at 14:05
  • Does your json look similar to the one in this question?: https://stackoverflow.com/q/65233232/6386471 – user6386471 Dec 16 '20 at 14:05
  • @ScottBoston I tried to append the DF directly into a list but when I load it into the DF with all the data, it now just shows up as one column with all the data. – Lostsoul Dec 16 '20 at 14:18
  • Do you need to `pd.concat` with `axis = 1`? Oh, you could append the dataframes and not the dictionary to your list. – Scott Boston Dec 16 '20 at 14:57

1 Answers1

1

Your issue is that you have unecessary steps when getting your data. I am assuming that you intend to get the reference (or index) of each element from your Json along with the associated data.

It seems you end up with data that has this shape :

holdingsFullList = [{"Column1": {0: "IBM"}}, {"Column1": {1:"Google"}}]
pd.Dataframe.from_records(holdingFullList) # Your result

What you should do is avoid some steps by gathering your data in the simplest way, like with a list comprehension :

ticker_list = [...] # All your tickers references
holdindfFullList = [ticker, fundsDict[ticker] for ticker in ticker_list]

You'll get a list of tuples, like this one :

[(0, "IBM"), (1, "Google"), (2, "Apple")])

Now, you can simply give it to pandas:

pd.DataFrame(holdindsFullList, columns=["Reference", "Column1"])) # Expected result.

If fundsDict actually yields lists or tuples as values, instead of single values, you should add a * in your list comprehension then make sure you specify the correct number of column names to pandas:

holdindsFullList = [ticker, *fundsDict[ticker] for ticker in ticker_list]
pd.DataFrame(holdindsFullList, columns=["Reference", "Column1", "Column2" ...])) # Several columns
Whole Brain
  • 2,097
  • 2
  • 8
  • 18