2

I manage a PostgreSQL database and am working on a tool for users to access a subset of the database. The database has a number of columns, and in addition we use a huge number of hstore keys to store additional information specific to certain rows in the database. Basic example below

A          B        C        hstore   
"foo"      1        4        "Fruit"=>"apple", "Pet"=>"dog", "Country"=>"Norway" 
"bar"      4        6        "Pet"=>"cat", "Country"=>"Suriname", "Number"=>"5"
"foobar"   2        8
"baz"      3        1        "Fruit"=>"apple", "Name"=>"David"

The data is routinely exported to a CSV file like this:

COPY tableName TO '/filepath/file.csv' DELIMITER ',' CSV HEADER;

I read this into a Pandas dataframe in Python like this:

import pandas as pd
df = pd.read_csv('/filepath/file.csv')

I then access a subset of the data. This subset should have a common set of hstore keys in most, but not necessarily all rows.

I would like to create a separate column for each of the hstore keys. Where a key does not exist for a row, the cell should be left empty, or filled with a NULL or NAN value, whatever is easiest. What is the most effective way to do this?

lhmarsden
  • 177
  • 11

2 Answers2

1

You can use .str.extractall() to extract the keys and values from column hstore, then use .pivot() to transform the keys to column labels. Aggregate the entries for each row in original dataframe by .groupby() and .agg(). Set NaN for empty entries with .replace(). Finally, join back the result dataframe to original dataframe with .join():

df.join(df['hstore'].str.extractall(r'\"(.+?)\"=>\"(.+?)\"')
             .reset_index()
             .pivot(index=['level_0', 'match'], columns=0, values=1)
             .groupby(level=0)
             .agg(lambda x: ''.join(x.dropna()))
             .replace('', np.nan)
       )

Result:

          A  B  C                                               hstore   Country  Fruit   Name  Pet
0     "foo"  1  4  "Fruit"=>"apple", "Pet"=>"dog", "Country"=>"Norway"    Norway  apple    NaN  dog
1     "bar"  4  6                  "Pet"=>"cat", "Country"=>"Suriname"  Suriname    NaN    NaN  cat
2  "foobar"  2  8                                                 None       NaN    NaN    NaN  NaN
3     "baz"  3  1                    "Fruit"=>"apple", "Name"=>"David"       NaN  apple  David  NaN

If you want to get a new dataframe for the extraction instead of joining back to the original dataframe, you can remove the .join() step and do a .reindex(), as follows:

df_out = (df['hstore'].str.extractall(r'\"(.+?)\"=>\"(.+?)\"')
             .reset_index()
             .pivot(index=['level_0', 'match'], columns=0, values=1)
             .groupby(level=0)
             .agg(lambda x: ''.join(x.dropna()))
             .replace('', np.nan)
         )
df_out = df_out.reindex(df.index)

Result:

print(df_out)


    Country  Fruit   Name  Pet
0    Norway  apple    NaN  dog
1  Suriname    NaN    NaN  cat
2       NaN    NaN    NaN  NaN
3       NaN  apple  David  NaN
SeaBean
  • 22,547
  • 3
  • 13
  • 25
  • Thanks for your quick and useful reply. Almost there, but there is a problem with `str.extractall(r'\"(\w+)\"=>\"(\w+)\"')`. This is only returning key and value pairs where the values are integers if there are keys with string values and different keys with integer values. I have updated my example. – lhmarsden May 31 '21 at 14:32
  • @lhmarsden I tested your updated sample with `Number`. It's working fine without problem. The key `Number` and other string only fields are extracted normally. This should be the case since regex \w+ extracts alphabets as well as numbers. \w is equivalent to [a-zA-Z0-9_]. And with all keys and values being string (numbers are also stings here), there should be no difference whether the values are alphabets or numbers (in string format). – SeaBean Jun 01 '21 at 07:48
  • @lhmarsden To simulate your error case, would you please dump your sample data using e.g. `df.to_dict('list')` and post it here ? I will take a look and see anything special. – SeaBean Jun 01 '21 at 07:51
  • @lhmarsden Checked your dataset and found there are 2 columns `other` and `metadata` having the layout similar to your `hstore` column in the question. With detailed inspection of the data, seems like the contents in keys and values pairs can contains character other than word characters, i.e. include symbols other than [a-zA-Z0-9_] alphabets, numbers and underscore. Therefore, my regex previously assumed you have only word characters cannot extract some of them. See my edit above now fine-tuned to support also the symbols. Please retry and let me know the result. – SeaBean Jun 02 '21 at 19:28
  • 1
    Excellent! You are correct, we have 2 columns that use hstore key value pairs, however for now I only want to separate one of them. Your solution above does the trick. Thanks very much for your help. – lhmarsden Jun 03 '21 at 06:34
  • @lhmarsden Great! Glad that the solution can help! – SeaBean Jun 03 '21 at 06:55
0

If you don't mind installing an additional library (sqlalchemy), as suggested here, you can do the following:

import pandas as pd
import sqlalchemy.dialects.postgresql as postgresql

hstore_to_dict = postgresql.HSTORE().result_processor(None, None)

df = pd.read_csv("/filepath/file.csv")
df["hstore"] = df["hstore"].fillna("")

hstore_dict = df["hstore"].map(hstore_to_dict)
hstore_df = pd.json_normalize(hstore_dict)

# optionally merge the expanded hstore dataframe with the original dataframe 
df = pd.concat([df, hstore_df], axis=1)
df.drop("hstore", axis=1, inplace=True)

pd.json_normalize converts a list of dictionaries to a dataframe, as answered here.

I assume your input CSV file looks like this:

A,B,C,hstore
bar,4,6,"""Pet""=>""cat"", ""Number""=>""5"", ""Country""=>""Suriname"""
foobar,2,8,
baz,3,1,"""Name""=>""David"", ""Fruit""=>""apple"""
foo,1,4,"""Pet""=>""dog"", ""Fruit""=>""apple"", ""Country""=>""Norway"""

After running the code, the output would be this dataframe:

        A  B  C  Pet Number   Country   Name  Fruit
0     bar  4  6  cat      5  Suriname    NaN    NaN
1  foobar  2  8  NaN    NaN       NaN    NaN    NaN
2     baz  3  1  NaN    NaN       NaN  David  apple
3     foo  1  4  dog    NaN    Norway    NaN  apple
F1refly
  • 336
  • 2
  • 7