2

I'm doing some web-scraping and I'm storing the variables of interest in form of:

a = {'b':[100, 200],'c':[300, 400]}

This is for one page, where there were two b's and two c's. The next page could have three of each, where I'd store them as:

b = {'b':[300, 400, 500],'c':[500, 600, 700]}

When I go to create a DataFrame from the list of dict's, I get:

import pandas as pd
df = pd.DataFrame([a, b])

df
                 b                c
0       [100, 200]       [300, 400]
1  [300, 400, 500]  [500, 600, 700]

What I'm expecting is:

df
     b    c
0  100  300
1  200  400
2  300  500
3  400  600
4  500  700

I could create a DataFrame each time I store a page and concat the list of DataFrame's at the end. However, based on experience, this is very expensive because the construction of thousands of DataFrame's is much more expensive than creating one DataFrame from a lower-level constructor (i.e., list of dict's).

MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
Ryan Erwin
  • 807
  • 1
  • 11
  • 30
  • Possible duplicate of [this question](http://stackoverflow.com/q/38577737/6525140)? Not exactly, but at least both questions are strongly related to each other. – Michael Hoff Jul 27 '16 at 22:37
  • Iterating over keys, and merging lists seems like the solution you'd like. – Bedi Egilmez Jul 27 '16 at 22:44
  • Do you need the `a` and `b` dicts for anything else? If not, you could just keep appending data to dict `a` as you receive it from all pages, then do `df = pd.DataFrame(a)`. – Matthias Fripp Jul 27 '16 at 22:53

3 Answers3

1

Try this change the keys for clarity:

a = {'e':[100, 200],'f':[300, 400]}
b = {'e':[300, 400, 500],'f':[500, 600, 700]}
c = {'e':[300, 400, 500],'f':[500, 600, 700]}

listDicts = [a,b,c]
dd= {}

for x in listDicts:
    for k in listDicts[0].keys():
        try:    dd[k] = dd[k] + x[k]
        except: dd[k] = x[k]

df = pd.DataFrame(dd)

     e    f
0  100  300
1  200  400
2  300  500
3  400  600
4  500  700
5  100  300
6  200  400
7  300  500
8  400  600
9  500  700
Merlin
  • 24,552
  • 41
  • 131
  • 206
1

Comprehensions FTW (maybe not the fastest, but can you get any more pythonic?):

import pandas as pd

list_of_dicts = [{'b': [100, 200], 'c': [300, 400]},
                 {'b': [300, 400, 500], 'c': [500, 600, 700]}]

def extract(key):
    return [item for x in list_of_dicts for item in x[key]]

df = pd.DataFrame({k: extract(k) for k in ['b', 'c']})

EDIT:

I stand corrected. It is just as fast as some of the other approaches.

import pandas as pd
import toolz

list_of_dicts = [{'b': [100, 200], 'c': [300, 400]},
                 {'b': [300, 400, 500], 'c': [500, 600, 700]}]

def extract(key):
    return [item for x in list_of_dicts for item in x[key]]

def merge_dicts(trg, src):
    for k, v in src.items():
        trg[k].extend(v)

def approach_AlbertoGarciaRaboso():
    df = pd.DataFrame({k: extract(k) for k in ['b', 'c']})

def approach_root():
    df = pd.DataFrame(toolz.merge_with(lambda x: list(toolz.concat(x)), list_of_dicts))

def approach_Merlin():
    dd = {}
    for x in list_of_dicts:
        for k in list_of_dicts[0].keys():
            try:    dd[k] = dd[k] + x[k]
            except: dd[k] = x[k]
    df = pd.DataFrame(dd)

def approach_MichaelHoff():
    merge_dicts(list_of_dicts[0], list_of_dicts[1])
    df = pd.DataFrame(list_of_dicts[0])


%timeit approach_AlbertoGarciaRaboso()  # 1000 loops, best of 3: 501 µs per loop
%timeit approach_root()                 # 1000 loops, best of 3: 503 µs per loop
%timeit approach_Merlin()               # 1000 loops, best of 3: 516 µs per loop
%timeit approach_MichaelHoff()          # 100 loops, best of 3: 2.62 ms per loop
Alicia Garcia-Raboso
  • 13,193
  • 1
  • 43
  • 48
  • You can not time my approach like this. My function modifies the given dictionary, thus you are creating very long lists using timeit... Another thing is the performance for integer-lists (and dicts) significantly longer than 2-3 elements. – Michael Hoff Apr 21 '17 at 10:28
0

What about simply merging the dictionaries in each step?

import pandas as pd

def merge_dicts(trg, src):
    for k, v in src.items():
        trg[k].extend(v)

a = {'b':[100, 200],'c':[300, 400]}
b = {'b':[300, 400, 500],'c':[500, 600, 700]}

merge_dicts(a, b)

print(a)

# {'c': [300, 400, 500, 600, 700], 'b': [100, 200, 300, 400, 500]}

print(pd.DataFrame(a))

#     b    c
# 0  100  300
# 1  200  400
# 2  300  500
# 3  400  600
# 4  500  700
Michael Hoff
  • 6,119
  • 1
  • 14
  • 38