0

Doing Udacity ML course. After df_final.join(df_temp, how="left") get NaN, but in the course venv everything works great. Where might be the problem?

P.S.: I also tried df_temp.index = pd.to_datetime(df_temp.index, utc=True) for each, seems no effect.

Here we load data.

import yfinance as yf

tickets = ["AAPL", "AMD", "GOOG", "GLD"]

def download_tickets(tickets):
    for ticket in tickets:      
        df = yf.Ticker(ticket)
        df = df.history(period="max")
        df.to_csv(symbol_to_path(ticket))

Here we create path to csv from symbol.

def symbol_to_path(symbol, base_dir="data"):
    if not os.path.exists(base_dir):
        os.mkdir(base_dir)
    return os.path.join(base_dir, "{}.csv".format(str(symbol)))

Here we join data.

    # Create empty df with specified dates. 
    start_date = "2022-01-01"
    end_date = "2023-01-01"
    dates = pd.date_range(start_date, end_date)
    df_final = pd.DataFrame(index=dates)
    df_final.index = pd.to_datetime(df_final.index, utc=True)
    
    # Combine all with df_final
    for ticket in tickets:
        file_path = symbol_to_path(symbol)
        df_temp = pd.read_csv(file_path, parse_dates=True, index_col="Date",
                              usecols=["Date", "Close"], na_values=["nan"])
        df_temp = df_temp.rename(columns={"Close": symbol})
        df_final = df_final.join(df_temp, how="left")
        print(df_temp.head())
        print(df_final.head())

    return df_final

Output:

As you see, float converts to NaN for left join

For right join we get data, but not for the range 2022-01-01/2023-01-01

Inner join

Outer join

Thank you.

UPD: Data after 2021

Blob
  • 3
  • 3
  • 3
    No one, but you, is seeing that data that you use, so how can we tell you why the result is `Nan` ? – Luuk Jan 14 '23 at 15:54
  • probably you want `how="inner"`? it will join only on intersecting indices – Artyom Akselrod Jan 14 '23 at 16:13
  • @Luuk, I have attached a photo of the output. Is it not displayed? – Blob Jan 14 '23 at 17:30
  • @Artyom Akselrod, I tried left, right, inner, outer, cross. Do I need to attach all results? – Blob Jan 14 '23 at 17:32
  • You can read all those values, and [check for NaN values](https://stackoverflow.com/questions/944700/how-can-i-check-for-nan-values), assign something that you want before trying to print it .... – Luuk Jan 14 '23 at 18:01
  • @Luuk, as you can see, there are 2 prints: first for df_temp and second for df_final. On the first output we see float values like 0.99874, so there are no NaN values in df_temp. Or did I miss the point? – Blob Jan 14 '23 at 19:28
  • A [mre] is needed. this should include the data that is used for the input (at least for one symbol with which this problem can be reproduced. – Luuk Jan 15 '23 at 09:34
  • @Luuk, thank you for patience, how may I do this? I've already done head() for each symbol. Data is in csv files, site does not allow to upload them :( Anyway, If this may help, I will edit the question with full code to show, how to get them. – Blob Jan 15 '23 at 11:57
  • it seems that you do not have history for your tickets after 2004, can you show us data frame of single ticket ? especially for 2022 year – Artyom Akselrod Jan 16 '23 at 10:50
  • @ArtyomAkselrod, sure, here you are! Check UPD. – Blob Jan 16 '23 at 17:30

1 Answers1

0

The problem is in time zones. Tickets data is in -05:00 (I assume new york), while you generate df_final at UTC +00:00, when you join, pandas cannot find intersection in indices.

Simplest solution for me was to change df_final timezone (tz), ie generate with correct tz

# Create empty df with specified dates. 
start_date = "2022-01-01"
end_date = "2023-01-01"
dates = pd.date_range(start_date, end_date, tz='-05:00') # change here
df_final = pd.DataFrame(index=dates)
#     df_final.index = pd.to_datetime(df_final.index, utc=True) # NOT needed anymore
 
Artyom Akselrod
  • 946
  • 6
  • 14