0

This is done in Python 2, and I am using Pandas. While I would LOVE to do this in SQL, which probably makes my life much easier, that is not an option.

I have two dataframes, x and y. Dataframe x has a column, 'BuyerNames' and y has, 'Buyers', and both contain a column, 'Section', among other shared columns. Entries in 'x(BuyerNames)' and 'y(Buyers) are of the same set, and data in shared columns is usually consistent. I want to create a dataframe which is a subset of y, call it y*, such that, for rows with corresponding indices, the value of 'BuyerNames' is equal to that of 'Buyers'. This dataframe is the same length, and entries not satisfying this condition contain an NaN entry.

I am creating a third dataframe, z, and it has the union of the columns of x and y. My last goal is to populate z with the entries of 'Section' contained in y', preserving index.

I am very new to databases, hence what is probably a very confusing description of what is likely very simple. How do I achieve this in an efficient manner? There is currently a solution that uses a for loop in Python, which is very slow and not scaleable. I cannot post code, sadly.

1 Answers1

0

You are correct in that this is a problem designed for a one-line sqlite query:

CREATE TABLE IF NOT EXISTS z AS 
SELECT section, x.buyernames AS buyernames_id, y.buyers AS buyers_id
FROM x JOIN y ON x.buyernames = y.buyers;

...but it's certainly possible in pandas as well. The easiest method, in my mind, would be to generate two separate dataframes, one from each database. Then use pandas' 'concat' feature to merge the two dataframes. The documentation has a pretty good description of the options.

I can sort of picture your tables but I'm not completely clear on the structure. It'd be best if you looked through the examples and found which one matched your problem and I could help you further. Perhaps concat isn't the best approach; there's also a join command as well.

As for the final step, I'm unsure of which API you plan on using for the new database, so it's hard to give you specific guidance on how to populate a new table. Pandas certainly has the ability to send sql but I am not aware of any way to create a new table. What would work best is sqlalchemy, but it's got a learning curve if you're unfamiliar with ORM or databases in general.

EDIT: Since you're unable to provide data, here's a made up example of a join command. This should be very fast, not quite as fast as raw sql but certainly faster than a for loop. First, the dummy data:

left = pd.DataFrame({'buyernames': ['john', 'frank', 'joe', 'june'],
'section': ['s1', 's2', 's1', 's2'],
'col3': ['C0', 'C1', 'C2', 'C3'],
'col4': ['D0', 'D1', 'D2', 'D3']},
index=[0, 1, 2, 3])

right = pd.DataFrame({'buyers': ['june', 'john', 'frank', 'joe'],
'section': ['s1', 's2', 's1', 's2'],
'col3': ['C4', 'C5', 'C6', 'C7'],
'col4': ['D4', 'D5', 'D6', 'D7']},
index=[4, 5, 6, 7])

and the pandas commands:

result = pd.merge(left, right, on='section')
Noob G
  • 21
  • 2
  • Very sorry, every time I meant dataframes, not databases, it seems. There IS a solution available, but it manually sorts through elements ala a for loop. The main challenge is to find a FAST way, likely by using a built in command. – Yaniv Proselkov Jan 09 '19 at 16:36