2

I'm reading tables from the web using pd.read_html and using a for loop in pandas to create a single dataframe from multiple dataframes. I'm able to successfully create a data frame for any given year but I'm missing the correct logic in the for loop to:

(1) Read data, (2) create a dataframe (3) Go to the next year and (4) Append that dataframe to previous dataframe.

The ideal outcome should be 1 dataframe with ~500 rows and 13 columns (for 2 years worth of data). Thanks!

for x in range(2017, 2019):  
    dfs = pd.read_html('https://www.treasury.gov/resource-center/data-chart-center/interest-rates/pages/TextView.aspx?data=yieldYear&year=(%d)' % x, skiprows=1) 
    for df in dfs:
        print df
DateList = ['Date', '1 mo', '2 mo', '3 mo', '6 mo', '1 yr', '2 yr', '3 yr', '5 yr', '7 yr', '10 yr', '20 yr', '30 yr']
df.columns = DateList
Mark
  • 25
  • 1
  • 1
  • 3

2 Answers2

3

You probably want something like this:

myDataFrame = pd.DataFrame()
for x in range(2017, 2019):  
    dfs = pd.read_html('https://www.treasury.gov/resource-center/data-chart-center/interest-rates/pages/TextView.aspx?data=yieldYear&year=(%d)' % x, skiprows=1) 
    for df in dfs:
        print df
        myDataFrame = myDataFrame.append(df)
DateList = ['Date', '1 mo', '2 mo', '3 mo', '6 mo', '1 yr', '2 yr', '3 yr', '5 yr', '7 yr', '10 yr', '20 yr', '30 yr']
df.columns = DateList

Edit: See @Parfait's comment. Apparently you shouldn't use DataFrame.append in a loop as it may cause problems with memory.

markos
  • 71
  • 1
  • 6
  • 6
    [Never call DataFrame.append or pd.concat inside a for-loop. It leads to quadratic copying](https://stackoverflow.com/a/36489724/1422451). – Parfait Dec 26 '18 at 00:35
3

Consider building a list of data frames, then concatenate items once outside loop. Specifically, below uses a list comprehension that also assigns columns in each iteration, followed by a pd.concat call.

url = 'https://www.treasury.gov/resource-center/data-chart-center/interest-rates/' + \
      'pages/TextView.aspx?data=yieldYear&year=({yr})'

DateList = ['Date', '1 mo', '2 mo', '3 mo', '6 mo', '1 yr', '2 yr',
            '3 yr', '5 yr', '7 yr', '10 yr', '20 yr', '30 yr']

dfs = [(pd.read_html(url.format(yr=x), skiprows=1)[1]
          .set_axis(DateList, axis='columns', inplace=False)) for x in range(2017, 2019)]

final_df = pd.concat(dfs, ignore_index=True)

print(final_df.head())
#        Date  1 mo  2 mo  3 mo  6 mo  ...    5 yr  7 yr  10 yr  20 yr  30 yr
# 0  01/03/17  0.52   NaN  0.53  0.65  ...    1.94  2.26   2.45   2.78   3.04
# 1  01/04/17  0.49   NaN  0.53  0.63  ...    1.94  2.26   2.46   2.78   3.05
# 2  01/05/17  0.51   NaN  0.52  0.62  ...    1.86  2.18   2.37   2.69   2.96
# 3  01/06/17  0.50   NaN  0.53  0.61  ...    1.92  2.23   2.42   2.73   3.00
# 4  01/09/17  0.50   NaN  0.50  0.60  ...    1.89  2.18   2.38   2.69   2.97
Parfait
  • 104,375
  • 17
  • 94
  • 125
  • thanks for your response. I tried the above but I received a syntax error saying it didn't like the square brackets. I think made them parenthesis and it returned an error: " 'list' object has no attribute 'set_axis'". I then tried to break up the read_html part into two steps: 'dfs = [(pd.read_html(url.format(yr=x), skiprows=1) for x in range(2017, 2019))] dfs_final = pd.DataFrame(data=dfs, columns = 'DateList')' – Mark Dec 26 '18 at 13:29
  • See edit. Apparently, your needed data table is the second html table on web page, hence we need to index `read_html` with `[1]`. Also, `set_axis` does work with *inplace=False`. – Parfait Dec 26 '18 at 14:34
  • AWESOME! Thank you so much. I learned a ton in this short little exercise. Really appreciate the input. – Mark Dec 26 '18 at 21:58
  • Great to hear. Glad to help! Cheers! – Parfait Dec 26 '18 at 22:06