Problem Statement
I have a list of tuples of dicts: [(A, B), (A, B),...]. I wrote A
and B
for the dictionaries because the keys are the same across these "types".
I want a dataframe with some keys from A
and some keys from B
.
Some of the keys in A
are also present in B
. I'd like to keep the keys from A
.
Ways of approaching it:
I can think of a couple ways, and I'm curious which will be more performant. I've listed them in the order of my best guess as to performance:
A list comprehension, building new dictionaries (or extending
A
with parts ofB
) and thenpd.DataFrame.from_records
.pd.DataFrame.from_records
has an exclude parameter. Merge the larger dicts first and then exclude columns when building the dataframe.Transpose the list of tuples (maybe
zip(*)
?), create two dataframes with.from_records
, one for each A and B, remove unnecessary columns from each, and then glue the resulting dataframes together side by side.Make each dict (row) a dataframe and then glue them on top of one another vertically (
append
orconcat
or something).
As a complete newbie to pandas, it seems to difficult to tell what each operation is, and when it's building a view or doing a copy, so I can't tell what is expensive and what isn't.
Am I missing an approach to this?
Are my solutions in the correct order of performance?
If instead of dictionaries,
A
andB
were dataframes, would concatenating them be faster? How much memory overhead does a dataframe have, and is it ever common practice to have a one-row dataframe?
Specifics:
Here's some simplified example data,
[({"chrom": "chr1", "gStart": 1000, "gEnd": 2000, "other": "drop this"},
{"chrom": "chr1": "pStart": 1500, "pEnd": 2500, "drop": "this"}),
({"chrom": "chr2", "gStart": 8000, "gEnd": 8500, "other": "unimportant"},
{"chrom": "chr2": "pStart": 7500, "pEnd": 9500, "drop": "me"}) ]
The result I'd like I think would be the outcome of:
pd.DataFrame.from_records([
{"chrom": "chr1", "gStart": 1000, "gEnd": 2000, "pStart": 1500, "pEnd": 2500},
{"chrom": "chr2", "gStart": 8000, "gEnd": 8500, "pStart": 7500, "pEnd": 9500} ] )
Pseudocode of the solution I'd like:
I think this would work if dictionaries had a nice, in-place select
method:
A_fields = [...]
B_fields = [...]
A_B_merged = [a.select(A_fields).extend(b.select(B_fields)) for a, b in A_B_not_merged]
A_B_dataframe = pd.DataFrame.from_records(A_B_merged)