3

Problem Statement

I have a list of tuples of dicts: [(A, B), (A, B),...]. I wrote A and B for the dictionaries because the keys are the same across these "types".

I want a dataframe with some keys from A and some keys from B.

Some of the keys in A are also present in B. I'd like to keep the keys from A.

Ways of approaching it:

I can think of a couple ways, and I'm curious which will be more performant. I've listed them in the order of my best guess as to performance:

  • A list comprehension, building new dictionaries (or extending A with parts of B) and then pd.DataFrame.from_records.

  • pd.DataFrame.from_records has an exclude parameter. Merge the larger dicts first and then exclude columns when building the dataframe.

  • Transpose the list of tuples (maybe zip(*)?), create two dataframes with .from_records, one for each A and B, remove unnecessary columns from each, and then glue the resulting dataframes together side by side.

  • Make each dict (row) a dataframe and then glue them on top of one another vertically (append or concat or something).

As a complete newbie to pandas, it seems to difficult to tell what each operation is, and when it's building a view or doing a copy, so I can't tell what is expensive and what isn't.

  • Am I missing an approach to this?

  • Are my solutions in the correct order of performance?

  • If instead of dictionaries, A and B were dataframes, would concatenating them be faster? How much memory overhead does a dataframe have, and is it ever common practice to have a one-row dataframe?

Specifics:

Here's some simplified example data,

[({"chrom": "chr1", "gStart": 1000, "gEnd": 2000, "other": "drop this"}, 
  {"chrom": "chr1": "pStart": 1500, "pEnd": 2500, "drop": "this"}), 
 ({"chrom": "chr2", "gStart": 8000, "gEnd": 8500, "other": "unimportant"}, 
  {"chrom": "chr2": "pStart": 7500, "pEnd": 9500, "drop": "me"}) ]

The result I'd like I think would be the outcome of:

 pd.DataFrame.from_records([ 
  {"chrom": "chr1", "gStart": 1000, "gEnd": 2000, "pStart": 1500, "pEnd": 2500}, 
  {"chrom": "chr2", "gStart": 8000, "gEnd": 8500, "pStart": 7500, "pEnd": 9500}  ] )

Pseudocode of the solution I'd like:

I think this would work if dictionaries had a nice, in-place select method:

A_fields = [...]
B_fields = [...]
A_B_merged = [a.select(A_fields).extend(b.select(B_fields)) for a, b in A_B_not_merged]

A_B_dataframe = pd.DataFrame.from_records(A_B_merged)
Alex Lenail
  • 12,992
  • 10
  • 47
  • 79

2 Answers2

1

You need to go down two levels of your input to get it processed. Your best friend is then chain.from_iterable:

import itertools as it
pd.DataFrame.from_records(it.chain.from_iterable(l))
Out[21]: 
  chrom    gEnd  gStart    pEnd  pStart
0  chr1  2000.0  1000.0     NaN     NaN
1  chr1     NaN     NaN  2500.0  1500.0
2  chr2  8500.0  8000.0     NaN     NaN
3  chr2     NaN     NaN  9500.0  7500.0

This requires classic and easy cleanup:

pd.DataFrame.from_records(it.chain.from_iterable(l)).set_index('chrom').stack().unstack()
Out[22]: 
         gEnd  gStart    pEnd  pStart
chrom                                
chr1   2000.0  1000.0  2500.0  1500.0
chr2   8500.0  8000.0  9500.0  7500.0
Zeugma
  • 31,231
  • 9
  • 69
  • 81
  • Hi @Boud, Thanks for your reply! 2 questions: What does pandas do when there are overlapping keys in `A` and `B`? And would you mind explaining a little more in your answer what `.set_index('chrom').stack().unstack()` is for? – Alex Lenail Feb 06 '17 at 15:23
  • Also, though my example data is oversimplified, there will be columns from `A` and columns from `B` which I'd like to drop. How should I be dropping them? – Alex Lenail Feb 06 '17 at 15:24
  • I'm concerned about `.set_index('chrom').stack().unstack()` since I have many rows which will have `chrom: 'chr1'`. I don't want to index by chrom. Nevertheless I like the `chain.from_iterable`... @Boud – Alex Lenail Feb 06 '17 at 18:25
0

Using plain old dictionary merge that merges the start and end dictionary via pythonic (Python 3.5+) way and then uses from_records to construct DataFrame.

pd.DataFrame.from_records([{**d[0],**d[1]} for d in k])

  chrom  gEnd  gStart  pEnd  pStart
0  chr1  2000    1000  2500    1500
1  chr2  8500    8000  9500    7500
kuriouscoder
  • 5,394
  • 7
  • 26
  • 40
  • I really like this syntax, but how performant is it? And what do I do if I want to drop most of the columns from both `A` and `B` (or `d[0]` and `d[1]` in your solution)? – Alex Lenail Feb 06 '17 at 17:58
  • I ended up using this as the basis of my solution. =) – Alex Lenail Feb 06 '17 at 19:22
  • Care to break down how `**d[0]` works? is it a way of telling the dict expansion operator to only pick out the key (0th element) of each dict tuple? – matanster Jul 13 '18 at 15:02