0

I need to merge some data with some generated by a coworker. This reminds me of a JOIN in an SQL database, but we are not using databases, just an Excel file or .csv file for a couple of dozen entries, with a few columns each.

Is there a Python library I can use to view these data structures in an ad-hoc in-memory database and use an OUTER JOIN to merge?

example of my data:

Atomic symbol   Atomic number
H               1
He              2
Be              4
Si              14
Fe              26
U               92
Pu              94

his data:

Atomic symbol   Name       Hazard
H               Hydrogen   ignition, combustion
Be              Beryllium  dust is toxic
As              Arsenic    toxic
Pu              Plutonium  dust is toxic
Jason S
  • 184,598
  • 164
  • 608
  • 970

2 Answers2

2

If you have Pandas, using DataFrame.merge is the most convenient method:

import pandas as pd
from io import StringIO

my_data = '''\
Atomic symbol   Atomic number
H               1
He              2
Be              4
Si              14
Fe              26
U               92
Pu              94'''

his_data = '''\
Atomic symbol   Name       Hazard
H               Hydrogen   ignition, combustion
Be              Beryllium  dust is toxic
As              Arsenic    toxic
Pu              Plutonium  dust is toxic'''

my_df = pd.read_csv(StringIO(my_data), sep='\s{2,}')
his_df = pd.read_csv(StringIO(his_data), sep='\s{2,}')
joined_df = pd.merge(my_df, his_df, on=['Atomic symbol'], how='outer')
print(joined_df)

yields

  Atomic symbol  Atomic number       Name                Hazard
0             H            1.0   Hydrogen  ignition, combustion
1            He            2.0        NaN                   NaN
2            Be            4.0  Beryllium         dust is toxic
3            Si           14.0        NaN                   NaN
4            Fe           26.0        NaN                   NaN
5             U           92.0        NaN                   NaN
6            Pu           94.0  Plutonium         dust is toxic
7            As            NaN    Arsenic                 toxic

Or you could use sqlite3, which is part of Python's standard library. sqlite does not currently support FULL OUTER JOINs, however, so you would have to construct the OUTER JOIN yourself by using LEFT JOINs and UNIONs:

import sqlite3
import csv
from io import StringIO

my_data = '''\
"Atomic symbol","Atomic number"
"H","1"
"He","2"
"Be","4"
"Si","14"
"Fe","26"
"U","92"
"Pu","94"'''

his_data = '''\
"Atomic symbol","Name","Hazard"
"H","Hydrogen","ignition, combustion"
"Be","Beryllium","dust is toxic"
"As","Arsenic","toxic"
"Pu","Plutonium","dust is toxic"'''


with sqlite3.connect(':memory:') as conn:
    cursor = conn.cursor()
    sql = '''CREATE TABLE my_data
             (my_data_id INTEGER PRIMARY KEY AUTOINCREMENT,
              Atomic_symbol TEXT,
              Atomic_number INTEGER)'''
    cursor.execute(sql)

    my_data = csv.reader(StringIO(my_data), delimiter=',', quotechar='"')
    next(my_data)
    sql = '''INSERT INTO my_data (Atomic_symbol, Atomic_number) VALUES (?, ?)'''
    cursor.executemany(sql, my_data)

    sql = '''CREATE TABLE his_data
             (his_data_id INTEGER PRIMARY KEY AUTOINCREMENT,
              Atomic_symbol TEXT,
              Name TEXT,
              Hazard TEXT)'''
    cursor.execute(sql)
    his_data = csv.reader(StringIO(his_data), delimiter=',', quotechar='"')
    next(his_data)
    sql = '''INSERT INTO his_data (Atomic_symbol, Name, Hazard) VALUES (?, ?, ?)'''    
    cursor.executemany(sql, his_data)

    sql = '''SELECT m.Atomic_symbol, m.Atomic_number, h.Name, h.Hazard 
             FROM my_data m
             LEFT JOIN his_data h
             USING (Atomic_symbol)
             UNION ALL
             SELECT h.Atomic_symbol, m.Atomic_number, h.Name, h.Hazard 
             FROM his_data h
             LEFT JOIN my_data m
             USING (Atomic_symbol)
             WHERE m.Atomic_symbol is NULL'''
    cursor.execute(sql)
    result = cursor.fetchall()
    print('\n'.join([' '.join(map('{:10}'.format, map(str, row))) for row in result]))

yields

H          1          Hydrogen   ignition, combustion
He         2          None       None      
Be         4          Beryllium  dust is toxic
Si         14         None       None      
Fe         26         None       None      
U          92         None       None      
Pu         94         Plutonium  dust is toxic
As         None       Arsenic    toxic     
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
1

You may load it as in-memory SQL, or use pandas.

Assume we have two CSV files as you defined above:

/tmp/x.csv

"Atomic symbol","Atomic number"
"H",1
"He",2
"Be",4
"Si",14
"Fe",26
"U",92
"Pu",94

and /tmp/y.csv

"Atomic symbol","Name","Hazard"
"H","Hydrogen","ignition, combustion"
"Be","Beryllium","dust is toxic"
"As","Arsenic","toxic"
"Pu","Plutonium","dust is toxic"

Pandas:

import pandas as pd
pd.set_option('display.max_columns', 100)

x = pd.read_csv('/tmp/x.csv')
y = pd.read_csv('/tmp/y.csv')
result = pd.merge(x, y, on=['Atomic symbol'], how='outer')

print(x)
print(y)
print(result)

 Atomic symbol  Atomic number
0             H              1
1            He              2
...

  Atomic symbol       Name                Hazard
0             H   Hydrogen  ignition, combustion
1            Be  Beryllium         dust is toxic
2            As    Arsenic                 toxic
...

  Atomic symbol  Atomic number       Name                Hazard
0             H            1.0   Hydrogen  ignition, combustion
1            He            2.0        NaN                   NaN
2            Be            4.0  Beryllium         dust is toxic
...

In-memory sql:

import csv, sqlite3

connection = sqlite3.connect(":memory:")

def load_into_table(con, table_name, file_name):
    with open(file_name) as f:
        dr = csv.DictReader(f)

        fields = ', '.join(['`{}`'.format(f) for f in dr.fieldnames])
        values = ', '.join(['?' for _ in dr.fieldnames])

        query = "CREATE TABLE {table_name} ({fields});".format(table_name=table_name, fields=fields)

        con.execute(query)

        to_db = [list(i.values()) for i in dr]

        insert_query = "INSERT INTO {table_name} VALUES ({values});".format(table_name=table_name, fields=fields, values=values)

        con.executemany(insert_query, to_db)
        con.commit()

load_into_table(con=connection, table_name='x', file_name='/tmp/x.csv')
load_into_table(con=connection, table_name='y', file_name='/tmp/y.csv')

print(connection.execute('SELECT * FROM x').fetchall())
print(connection.execute('SELECT * FROM y').fetchall())
print(connection.execute('SELECT * FROM x LEFT JOIN y ON x.`Atomic symbol` = y.`Atomic symbol`; ').fetchall())

[('H', '1'), ('He', '2'), ('Be', '4'), ...]
[('H', 'Hydrogen', 'ignition, combustion'), ('Be', 'Beryllium', 'dust is toxic'), ...]
[('H', '1', 'H', 'Hydrogen', 'ignition, combustion'), ('He', '2', None, None, None), ...]

Note: SQLite did not support outer join. You may emulate it: http://www.sqlitetutorial.net/sqlite-full-outer-join/

Adam Puza
  • 1,504
  • 13
  • 9