pandas: column formatting issues causing merge problems

Question

I have the following two databases:

url='https://raw.githubusercontent.com/108michael/ms_thesis/master/rgdp_catcode.merge'

df=pd.read_csv(url, index_col=0)
df.head(1)

    naics   catcode                                        GeoName  Description     ComponentName   year    GDP     state
0   22  E1600',\t'E1620',\t'A4000',\t'E5000',\t'E3000'...   Alabama     Utilities   Real GDP by state   2004    5205    AL

url='https://raw.githubusercontent.com/108michael/ms_thesis/master/mpl.Bspons.merge'
df1=pd.read_csv(url, index_col=0)

df1.head(1)    
    state   year    unemployment    log_diff_unemployment   id.thomas   party   type    date    bills   id.fec  years_exp   session     name    disposition     catcode
0   AK  2006    6.6     -0.044452   1440    Republican  sen     2006-05-01  s2686-109   S2AK00010   39  109     National Cable & Telecommunications Association     support     C4500

Regarding df, I had to manually input the catcode values. I think that is why the formatting is off. What I would like is to simply have the values without the \t prefix. I want to merge the dfs on catcode, state, year. I made a test earlier wherein a df1.catcode with only one value per cell was matched with the values in another df.catcode that had more than one value per cell and it worked.

So technically, all I need to do is lose the \t before each consecutive value in df.catcode, but additionally, if anyone has ever done a merge of this sort before, any 'caveats' learned through experience would be appreciated. My merge code looks like this:

mplmerge=pd.merge(df1,df, on=(['catcode', 'state', 'year']), how='left' )

I think this can be done with the regex method, I'm looking at the documentation now.

score 1 · Accepted Answer · edited May 23 '17 at 12:31

Cleaning catcode column in df is rather straightforward:

catcode_fixed = df.catcode.str.findall('[A-Z][0-9]{4}')

This will produce a series with a list of catcodes in every row:

catcode_fixed.head(3)
Out[195]: 
0    [E1600, E1620, A4000, E5000, E3000, E1000]
1           [X3000, X3200, L1400, H6000, X5000]
2           [X3000, X3200, L1400, H6000, X5000]
Name: catcode, dtype: object

If I understand correctly what you want, then you need to "ungroup" these lists. Here is the trick, in short:

catcode_fixed = catcode_fixed = catcode_fixed.apply(pd.Series).stack()
catcode_fixed.index = catcode_fixed.index.droplevel(-1)

So, we've got (note the index values):

catcode_fixed.head(12)
Out[206]: 
0    E1600
0    E1620
0    A4000
0    E5000
0    E3000
0    E1000
1    X3000
1    X3200
1    L1400
1    H6000
1    X5000
2    X3000
dtype: object

Now, dropping the old catcode and joining in the new one:

df.drop('catcode',axis = 1, inplace = True)
catcode_fixed.name = 'catcode'
df = df.join(catcode_fixed)

By the way, you may also need to use df1.reset_index() when merging the data frames.

pandas: column formatting issues causing merge problems

1 Answers1