0

I have a dataframe with a single column that contains States followed by their particular regions. The states are already in alphabetical order. I want to convert this single column into two columns.

import pandas as pd
places = [{'State':'Alabama'},{'State':'Auburn'},{'State':'Florence'},{'State':'Troy'},{'State':'Alaska'},{'State':'Fairbanks}]
df0 = pd.DataFrame(places)

s1 = df0.loc[df['State']=='Alabama'].index[0] #return row of state 'Alabama'
s2 = df0.loc[df['State']=='Alaska'].index[0]  #return row of state 'Alaska'
df1 = df0                                     # create new dataframe

for index, row in df0.iterrows():             # parse single column into two columns data
    if index == s1:
        df1['State']='Alabama'
    elif index <= s2:
        df1['Region']=row['State']
df

The above code returns:

     State    Region
0    Alabama   Alabama
1    Alabama   Alabama
...  ...       ...
500  Alabama   Alabama

Note: In my example I only use Alabama and Alaska but there are another 500 rows below this for the other states. Any idea how I can get to the following:

     State     Region
0    Alabama   Auburn
1    Alabama   Florence
2    Alabama   Troy
3    Alaska    Fairbanks
...  ...       ...
Shaido
  • 27,497
  • 23
  • 70
  • 73
Riskinit
  • 97
  • 2
  • 8

1 Answers1

1

First you will need to have all the data for states: It will be good if you download it and call the states. I will manually try to call it over here by the first few lines

#This is not necessary. Download the file and call states directly from it.
import re
import pandas as pd
a = 'http://code.activestate.com/recipes/577305-python-dictionary-of-us-states-and-territories/download/1/'
states = eval(re.sub("(?<=\w')\s+(?=[^:])",",","".join(['{']+list(pd.read_csv(a).index))))

Just run this code as it is and do print(states) see whether you have all the states in your IDE You can also directly copy paste it as states from here

Now that we have all the states we can solve the problem

import numpy as np

places = [{'State':'Alabama'},{'State':'Auburn'},{'State':'Florence'},{'State':'Troy'},{'State':'Alaska'},{'State':'Fairbanks'}]

df0 = pd.DataFrame(places)
df0['Regions'] = df0.State

B = pd.DataFrame(df0.set_index('Regions').State.replace(regex=f'^(?!({"|".join(states.values())})).*',value=np.nan).reset_index()).ffill()
B[B.State!=B.Regions].iloc[:,[1,0]]

     State    Regions
1  Alabama     Auburn
2  Alabama   Florence
3  Alabama       Troy
5   Alaska  Fairbanks
Onyambu
  • 67,392
  • 3
  • 24
  • 53
  • Thanks, this answer works. If you have time could you (1) explain what the replace statement is doing? (2) Why I need to use reset_index()? And when you call df.State are you just calling the column directly? So df['State'] == df.State? – Riskinit Aug 10 '18 at 10:55
  • @Riskinit 1. is`df['State'] == df.State`? yes. read more [here](https://stackoverflow.com/questions/41130255/in-a-pandas-dataframe-whats-the-difference-between-using-squared-brackets-or-d). 2 `reset_index()`. Technically it depends on how you solve the question. You will realize that I did set the `Regions` to be the indices, so I am undoing the process. I did not want to concatenate. You can decide not to set and reset. – Onyambu Aug 10 '18 at 13:27
  • @Riskinit 3.replace is just selecting all those that are not states and replacing them with `np.nan`. I am using `regular expressions`. So I first join all the states as `Alabama|Alaska|.....` then claim `^(?!...).*` which is a negative forward reference. It was easy in this case since the states were at the beginning of each sentence. You could have actually used `f'^((?!({"|".join(states.values())})).)*$'` which is generalized. you can read more [here](https://stackoverflow.com/questions/406230/regular-expression-to-match-a-line-that-doesnt-contain-a-word) – Onyambu Aug 10 '18 at 13:39