-1

I am trying to build a dataframe where the data is grabbed from multiple files. I have created an empty dataframe with the desired shape, but I am having trouble grabbing the data. I found this but when I concat, I am still getting NaN values. Edit2: I changed the order of df creation and put concat inside the for loop and same result. (for obvious reasons)

import pandas as pd
import os
import glob

def daily_country_framer():
    # create assignments
    country_source = r"C:\Users\USER\PycharmProjects\Corona Stats\Country Series"
    list_of_files = glob.glob(country_source + r"\*.csv")
    latest_file = max(list_of_files, key=os.path.getctime)
    last_frame = pd.read_csv(latest_file)
    date_list = []
    label_list = []

    # build date_list values
    for file in os.listdir(country_source):
        file = file.replace('.csv', '')
        date_list.append(file)

    # build country_list values
    for country in last_frame['Country']:
        label_list.append(country)

    # create dataframe for each file in folder
    for filename in os.listdir(country_source):
        filepath = os.path.join(country_source, filename)
        if not os.path.isfile(filepath):
            continue
        df1 = pd.read_csv(filepath)
    df = pd.DataFrame(index=label_list, columns=date_list)
    df1 = pd.concat([df])
    print(df1)


daily_country_framer()

Two sample dataframes: (notice the different shapes)

                Country  Confirmed  Deaths  Recovered
0                 World    1595350   95455     353975
1           Afghanistan        484      15         32
2               Albania        409      23        165
3               Algeria       1666     235        347
4               Andorra        583      25         58
..                  ...        ...     ...        ...
180             Vietnam        255       0        128
181  West Bank and Gaza        263       1         44
182      Western Sahara          4       0          0
183              Zambia         39       1         24
184            Zimbabwe         11       3          0

[185 rows x 4 columns]
                Country  Confirmed  Deaths  Recovered
0                 World    1691719  102525     376096
1           Afghanistan        521      15         32
2               Albania        416      23        182
3               Algeria       1761     256        405
4               Andorra        601      26         71
..                  ...        ...     ...        ...
181  West Bank and Gaza        267       2         45
182      Western Sahara          4       0          0
183               Yemen          1       0          0
184              Zambia         40       2         25
185            Zimbabwe         13       3          0

[186 rows x 4 columns]

Current output:

                   01-22-2020 01-23-2020  ... 04-09-2020 04-10-2020
World                     NaN        NaN  ...        NaN        NaN
Afghanistan               NaN        NaN  ...        NaN        NaN
Albania                   NaN        NaN  ...        NaN        NaN
Algeria                   NaN        NaN  ...        NaN        NaN
Andorra                   NaN        NaN  ...        NaN        NaN
...                       ...        ...  ...        ...        ...
West Bank and Gaza        NaN        NaN  ...        NaN        NaN
Western Sahara            NaN        NaN  ...        NaN        NaN
Yemen                     NaN        NaN  ...        NaN        NaN
Zambia                    NaN        NaN  ...        NaN        NaN
Zimbabwe                  NaN        NaN  ...        NaN        NaN

[186 rows x 80 columns]

Desired output: (where NaN equals corresponding values from target column or a list of all columns ie: if ['Confirmed'] then 0,1,2,3,4, if all then [0,0,0],[1,0,0],[2,0,0])

Luck Box
  • 90
  • 1
  • 13
  • 1
    Via an outer merge perhaps (on country) ? IMHO, you'll have an easier time treating your end df as a timeseries (meaning pivoting the one in your current output), with date index as datetime type – DrGorilla.eth Apr 11 '20 at 21:18
  • I tried `df = pd.DataFrame(index=label_list, columns=date_list)` and it came back with `KeyError: 'Country'`. Is my error resulting from setting `index=['Country']`? Trying `print(type(df['Country']))` comes back with the same error. – Luck Box Apr 11 '20 at 21:48
  • Oh wait, what are you trying to achieve exactly ? Can you provide an example of the desired output (since you have for each day/file 4 columns by country and end up with only one daily value per country). Btw, something that might help you navigate through mergin in Python : https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html – DrGorilla.eth Apr 11 '20 at 23:03
  • I am trying to find the corresponding NaN values for each country for each date. Each value would be either a target column or all column values as a list which then can be indexed later on. `World - 5 - 12 - 22 - 31; Country_1 - 0 - 0 - 1 - 1` if targeting `['Confirmed']` – Luck Box Apr 11 '20 at 23:34

1 Answers1

1

Your code (with comments inline):

import pandas as pd
import os
import glob

def daily_country_framer():
    # create assignments
    country_source = r"C:\Users\USER\PycharmProjects\Corona Stats\Country Series"
    list_of_files = glob.glob(country_source + r"\*.csv")
    latest_file = max(list_of_files, key=os.path.getctime)
    last_frame = pd.read_csv(latest_file)
    date_list = []
    label_list = []

    # build date_list values
    for file in os.listdir(country_source):
        file = file.replace('.csv', '')
        date_list.append(file)

    # build country_list values
    for country in last_frame['Country']: # == last_frame['Country'].tolist()
        label_list.append(country)

    # create dataframe for each file in folder
    for filename in os.listdir(country_source):
        filepath = os.path.join(country_source, filename)
        if not os.path.isfile(filepath):
            continue
        df1 = pd.read_csv(filepath)
        # you redefine df1 for every file in the loop. So if there
        # are 10 files, only the last one is actually used anywhere
        # outside this loop.
    df = pd.DataFrame(index=label_list, columns=date_list)
    df1 = pd.concat([df])
    # here you just redefined df1 again as the concatenation of the
    # empty dataframe you just created in the line above.
    print(df1)


daily_country_framer()

So hopefully that illuminates why you were getting the results you were getting. It was doing exactly what you asked it to do.

What you want to do is get a dictionary with dates as keys and the associated dataframe as values, then concatenate that. This can be quite expensive because of some quirks with how pandas does concatenation, but if you concatenate along axis=0, you should be fine.

A better way might be the following:

import pandas as pd
import os


def daily_country_framer(country_source):
    accumulator = {}
    # build date_list values
    for filename in os.listdir(country_source):
        date = filename.replace('.csv', '')
        filepath = os.path.join(country_source, filename)
        accumulator[date] = pd.read_csv(filepath)
    # now we have a dictionary of {date : data} -- perfect!
    df = pd.concat(accumulator)
    return df


daily_country_framer("C:\Users\USER\PycharmProjects\Corona Stats\Country Series")

Does that work?

Aaron
  • 801
  • 6
  • 13