0

I have a data set that looks like this:

enter image description here

I want to find the three year appreciation in hpi. Note that the hpi is on the tract level and the years range from 2012 - 2018.

The data set has all the states and counties and is much larger than the one I just demonstrated. I thought of using some type of group by lambda function such as this when I wanted to find the median of hpi based on year state and county

medians = (all_data.groupby(['Year', 'state', 'County_name'])['hpi']
             .transform(lambda x: x.median() if x.notnull().any() else np.nan)
          )
all_data['hpi'] = all_data['hpi'].fillna(medians)

But I was not able to adapt the code above for this purpose. Any suggestions are greatly appreciated.

justanewb
  • 133
  • 4
  • 15

2 Answers2

1

I've added an additional county to your data and made up fictionaly index for HPI for Barbour County:

state = ["Alabama"] * 12
county = ["Baldin County"] * 6 + ["Barbour County"] * 6
year = [x for y in range(2) for x in range(2012, 2018)]
hpi = [125, 130, 127.5, 142, 160, 139, 98, 108, 102, 115, 118, 114]
data = {"Year": year, "State": state, "County": county, "HPI": hpi}

df = pd.DataFrame(data)

# Sorting is necessary.
df = df.sort_values(['State', 'County', 'Year'])
print(df)

    Year    State          County    HPI
0   2012  Alabama   Baldin County  125.0
1   2013  Alabama   Baldin County  130.0
2   2014  Alabama   Baldin County  127.5
3   2015  Alabama   Baldin County  142.0
4   2016  Alabama   Baldin County  160.0
5   2017  Alabama   Baldin County  139.0
6   2012  Alabama  Barbour County   98.0
7   2013  Alabama  Barbour County  108.0
8   2014  Alabama  Barbour County  102.0
9   2015  Alabama  Barbour County  115.0
10  2016  Alabama  Barbour County  118.0
11  2017  Alabama  Barbour County  114.0

From this base, we shift 'HPI' and divide to give the data results you are looking for.

df["3 year appreciation"] = df.HPI / df['HPI'].shift(3)
print(df)

    Year    State          County    HPI  3 year appreciation
0   2012  Alabama   Baldin County  125.0                  NaN
1   2013  Alabama   Baldin County  130.0                  NaN
2   2014  Alabama   Baldin County  127.5                  NaN
3   2015  Alabama   Baldin County  142.0             1.136000
4   2016  Alabama   Baldin County  160.0             1.230769
5   2017  Alabama   Baldin County  139.0             1.090196
6   2012  Alabama  Barbour County   98.0             0.690141
7   2013  Alabama  Barbour County  108.0             0.675000
8   2014  Alabama  Barbour County  102.0             0.733813
9   2015  Alabama  Barbour County  115.0             1.173469
10  2016  Alabama  Barbour County  118.0             1.092593
11  2017  Alabama  Barbour County  114.0             1.117647

However, now you have NaNs at the start and incorrect values for the first three years of every county. To correct this, we groupby State/County, then retrieve the first three years for each group using head(3), then get the index values, then filter and set to zero.

df.loc[df.groupby(["State", "County"]).head(3).index, "3 year appreciation"] = 0
print(df)

    Year    State          County    HPI  3 year appreciation
0   2012  Alabama   Baldin County  125.0             0.000000
1   2013  Alabama   Baldin County  130.0             0.000000
2   2014  Alabama   Baldin County  127.5             0.000000
3   2015  Alabama   Baldin County  142.0             1.136000
4   2016  Alabama   Baldin County  160.0             1.230769
5   2017  Alabama   Baldin County  139.0             1.090196
6   2012  Alabama  Barbour County   98.0             0.000000
7   2013  Alabama  Barbour County  108.0             0.000000
8   2014  Alabama  Barbour County  102.0             0.000000
9   2015  Alabama  Barbour County  115.0             1.173469
10  2016  Alabama  Barbour County  118.0             1.092593
11  2017  Alabama  Barbour County  114.0             1.117647

Total code is:

import pandas as pd

state = ["Alabama"] * 12
county = ["Baldin County"] * 6 + ["Barbour County"] * 6
year = [x for y in range(2) for x in range(2012, 2018)]
hpi = [125, 130, 127.5, 142, 160, 139, 98, 108, 102, 115, 118, 114]
data = {"Year": year, "State": state, "County": county, "HPI": hpi}

df = pd.DataFrame(data)
df = df.sort_values(['State', 'County', 'Year'])

df["3 year appreciation"] = df.HPI / df['HPI'].shift(3)

df.loc[df.groupby(["State", "County"]).head(3).index, "3 year appreciation"] = 0
run-out
  • 3,114
  • 1
  • 9
  • 25
  • So when we sort the other columns wont be misaligned correct? – justanewb Jan 30 '20 at 18:39
  • Correct, I'm sorting by the order in the brackets, which is necessary for the solution. State first, county second, then date, and the HPI's will follow along. Sorting this way allows us to shift() and know whe are getting the correct values. – run-out Jan 30 '20 at 18:40
  • Of course, if you are missing a year, you will have a problem. – run-out Jan 30 '20 at 18:41
  • No years will be missing but when I apply the sort part I dont get the data in that format we need it in. I still have trailing 2012 years instead of 2012 2013,..etc – justanewb Jan 30 '20 at 18:46
  • I think based on the dataset I am working with your solution will not work. Which is why I thought a lambda function is necessary. – justanewb Jan 30 '20 at 18:52
  • Do you have one HPI value for each ['Year', 'State', 'County'] combination? Only use my code starting at the sort and check your headers. I think my column names are sllightly different. – run-out Jan 30 '20 at 18:54
  • Yes but the hpi is also for each tract which is much more granular. If you want to see a subsample of the entire dataset I would be happy to send it to you. – justanewb Jan 30 '20 at 18:56
  • OK, well that's different than your original question. If you have more data and granularity, then you will need a different approach I presume. I was working with what you gave me. Can you put an actual sample data into your question and state how you would like to deal with the granularity of data? – run-out Jan 30 '20 at 18:58
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/206937/discussion-between-run-out-and-justanewb). – run-out Jan 30 '20 at 18:59
  • Any chance you would help me I am still stuck – justanewb Jan 30 '20 at 19:55
1

Here is a minimal example that is working:

import pandas as pd

# create data
data = {"Year": [2010, 2011, 2012, 2013, 2014]*2,
        "State": ["Bama", "Bama", "Bama", "Bama", "Bama",
                  "NY", "NY", "NY", "NY", "NY"],
        "hpi": [100, 105, 110, 115, 120]*2}
data = pd.DataFrame.from_dict(data)

# Create column with 3y shifted hpi
data["hpi_3y"] = data.groupby(["State"])["hpi"].shift(3)
# compute your appreciation value from the columns
data["3y_appreciation"] = 100 + ((data["hpi"] / data["hpi_3y"] - 1) * 100)
data

Basically, you group by all relevant columns (not including the year). Then, you shift your values in the hpi column 3 rows = 3 years. Thereafter, you have the corresponding hpi and hpi_3y in the same row per observation and can just to the computation.

Output:

|   Year | State   |   hpi |   hpi_3y |   3y_appreciation |
|-------:|:--------|------:|---------:|------------------:|
|   2010 | Bama    |   100 |      nan |           nan     |
|   2011 | Bama    |   105 |      nan |           nan     |
|   2012 | Bama    |   110 |      nan |           nan     |
|   2013 | Bama    |   115 |      100 |           115     |
|   2014 | Bama    |   120 |      105 |           114.286 |
|   2010 | NY      |   100 |      nan |           nan     |
|   2011 | NY      |   105 |      nan |           nan     |
|   2012 | NY      |   110 |      nan |           nan     |
|   2013 | NY      |   115 |      100 |           115     |
|   2014 | NY      |   120 |      105 |           114.286 |
mc51
  • 1,883
  • 14
  • 28
  • Why dont we include the year in the groupby/ – justanewb Jan 30 '20 at 19:23
  • Because groupby gives you the unique combination of the columns you specify. Hence, you get a single row for Each Year and State in my example. This is not what you want. You want the unique state and all the years so that you can compute `hpi_3y` in a senseful way. – mc51 Jan 30 '20 at 19:29
  • DId you see my latest update on what the data looks like? The reason I ask is because my data is much more granular than the example I presented. – justanewb Jan 30 '20 at 19:47
  • Well, your data is really hard to read. Instead, you should give us a [minimal, reproducible example](https://stackoverflow.com/help/minimal-reproducible-example). Also, you could output your pandas data-frame in a better readable format like [this](https://stackoverflow.com/questions/33181846/programmatically-convert-pandas-dataframe-to-markdown-table). – mc51 Jan 30 '20 at 19:50
  • Moreover, if you understand my simple example properly, you should be able to adapt it to more complex cases. It should be enough in most cases to just add the appropriate columns to the `groupby` function. – mc51 Jan 30 '20 at 19:52
  • If I gave you the entire dataset, could you show me how to do it? – justanewb Jan 30 '20 at 20:01
  • Could you accept the answer as correct if you agree it meets your original request? Thanks – run-out Jan 31 '20 at 02:40