How to take out these elements and put them together into a dataframe

Question

[{'complete': True, 'volume': 116, 'time': '2020-01-17T19:15:00.000000000Z', 'mid': {'o': '1.10916', 'h': '1.10917', 'l': '1.10906', 'c': '1.10912'}}, {'complete': True, 'volume': 136, 'time': '2020-01-17T19:30:00.000000000Z', 'mid': {'o': '1.10914', 'h': '1.10922', 'l': '1.10908', 'c': '1.10919'}}, {'complete': True, 'volume': 223, 'time': '2020-01-17T19:45:00.000000000Z', 'mid': {'o': '1.10920', 'h': '1.10946', 'l': '1.10920', 'c': '1.10930'}}, {'complete': True, 'volume': 203, 'time': '2020-01-17T20:00:00.000000000Z', 'mid': {'o': '1.10930', 'h': '1.10931', 'l': '1.10919', 'c': '1.10928'}}, {'complete': True, 'volume': 87, 'time': '2020-01-17T20:15:00.000000000Z', 'mid': {'o': '1.10926', 'h': '1.10934', 'l': '1.10922', 'c': '1.10926'}}, {'complete': True, 'volume': 102, 'time': '2020-01-17T20:30:00.000000000Z', 'mid': {'o': '1.10926', 'h': '1.10928', 'l': '1.10913', 'c': '1.10920'}}, {'complete': True, 'volume': 277, 'time': '2020-01-17T20:45:00.000000000Z', 'mid': {'o': '1.10918', 'h': '1.10929', 'l': '1.10913', 'c': '1.10928'}}, {'complete': True, 'volume': 103, 'time': '2020-01-17T21:00:00.000000000Z', 'mid': {'o': '1.10927', 'h': '1.10929', 'l': '1.10920', 'c': '1.10924'}}, {'complete': True, 'volume': 54, 'time': '2020-01-17T21:15:00.000000000Z', 'mid': {'o': '1.10926', 'h': '1.10926', 'l': '1.10910', 'c': '1.10912'}}, {'complete': False, 'volume': 15, 'time': '2020-01-17T21:30:00.000000000Z', 'mid': {'o': '1.10913', 'h': '1.10918', 'l': '1.10912', 'c': '1.10913'}}]

I try to take out all the "time" and "mid" from this list. In the 'mid', there are 'o','h','l','c'dictionaries. is there any way to combine 'time' and these dictionaries into a dataframe?

enter image description here

score 2 · Accepted Answer · answered Jan 17 '20 at 21:52

Try

df = pd.DataFrame(your_data)
df = pd.concat([df['time'], df['mid'].apply(pd.Series)], axis=1)

                             time        o        h        l        c
0  2020-01-17T19:15:00.000000000Z  1.10916  1.10917  1.10906  1.10912
1  2020-01-17T19:30:00.000000000Z  1.10914  1.10922  1.10908  1.10919
2  2020-01-17T19:45:00.000000000Z  1.10920  1.10946  1.10920  1.10930
3  2020-01-17T20:00:00.000000000Z  1.10930  1.10931  1.10919  1.10928
4  2020-01-17T20:15:00.000000000Z  1.10926  1.10934  1.10922  1.10926
5  2020-01-17T20:30:00.000000000Z  1.10926  1.10928  1.10913  1.10920
6  2020-01-17T20:45:00.000000000Z  1.10918  1.10929  1.10913  1.10928
7  2020-01-17T21:00:00.000000000Z  1.10927  1.10929  1.10920  1.10924
8  2020-01-17T21:15:00.000000000Z  1.10926  1.10926  1.10910  1.10912
9  2020-01-17T21:30:00.000000000Z  1.10913  1.10918  1.10912  1.10913

score 1 · Answer 2 · answered Jan 17 '20 at 21:52

Try the following:

import pandas as pd

l = [{'complete': True, 'volume': 116, 'time': '2020-01-17T19:15:00.000000000Z', 'mid': {'o': '1.10916', 'h': '1.10917', 'l': '1.10906', 'c': '1.10912'}}, {'complete': True, 'volume': 136, 'time': '2020-01-17T19:30:00.000000000Z', 'mid': {'o': '1.10914', 'h': '1.10922', 'l': '1.10908', 'c': '1.10919'}}, {'complete': True, 'volume': 223, 'time': '2020-01-17T19:45:00.000000000Z', 'mid': {'o': '1.10920', 'h': '1.10946', 'l': '1.10920', 'c': '1.10930'}}, {'complete': True, 'volume': 203, 'time': '2020-01-17T20:00:00.000000000Z', 'mid': {'o': '1.10930', 'h': '1.10931', 'l': '1.10919', 'c': '1.10928'}}, {'complete': True, 'volume': 87, 'time': '2020-01-17T20:15:00.000000000Z', 'mid': {'o': '1.10926', 'h': '1.10934', 'l': '1.10922', 'c': '1.10926'}}, {'complete': True, 'volume': 102, 'time': '2020-01-17T20:30:00.000000000Z', 'mid': {'o': '1.10926', 'h': '1.10928', 'l': '1.10913', 'c': '1.10920'}}, {'complete': True, 'volume': 277, 'time': '2020-01-17T20:45:00.000000000Z', 'mid': {'o': '1.10918', 'h': '1.10929', 'l': '1.10913', 'c': '1.10928'}}, {'complete': True, 'volume': 103, 'time': '2020-01-17T21:00:00.000000000Z', 'mid': {'o': '1.10927', 'h': '1.10929', 'l': '1.10920', 'c': '1.10924'}}, {'complete': True, 'volume': 54, 'time': '2020-01-17T21:15:00.000000000Z', 'mid': {'o': '1.10926', 'h': '1.10926', 'l': '1.10910', 'c': '1.10912'}}, {'complete': False, 'volume': 15, 'time': '2020-01-17T21:30:00.000000000Z', 'mid': {'o': '1.10913', 'h': '1.10918', 'l': '1.10912', 'c': '1.10913'}}]

df = pd.DataFrame()

for ll in l:
     df = df.append(pd.DataFrame(ll['mid'], index=[ll['time']]))

Alexander · Answer 3 · 2020-01-18T06:43:07.320

1

Assuming your sample data is named data:

>>> pd.DataFrame([d['mid'] for d in data], index=[d['time'] for d in data])
                                      o        h        l        c
2020-01-17T19:15:00.000000000Z  1.10916  1.10917  1.10906  1.10912
2020-01-17T19:30:00.000000000Z  1.10914  1.10922  1.10908  1.10919
2020-01-17T19:45:00.000000000Z  1.10920  1.10946  1.10920  1.10930
2020-01-17T20:00:00.000000000Z  1.10930  1.10931  1.10919  1.10928
2020-01-17T20:15:00.000000000Z  1.10926  1.10934  1.10922  1.10926
2020-01-17T20:30:00.000000000Z  1.10926  1.10928  1.10913  1.10920
2020-01-17T20:45:00.000000000Z  1.10918  1.10929  1.10913  1.10928
2020-01-17T21:00:00.000000000Z  1.10927  1.10929  1.10920  1.10924
2020-01-17T21:15:00.000000000Z  1.10926  1.10926  1.10910  1.10912
2020-01-17T21:30:00.000000000Z  1.10913  1.10918  1.10912  1.10913

Timings

data *= 1000  # Now list of 10k dictionaries.

%timeit df = pd.DataFrame([d['mid'] for d in data], index=[d['time'] for d in data])
# 13.4 ms ± 361 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
df = pd.DataFrame(data)
df = pd.concat([df['time'], df['mid'].apply(pd.Series)], axis=1)
# 4.52 s ± 494 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
df = pd.DataFrame()
for record in data:
     df = df.append(pd.DataFrame(record['mid'], index=[record['time']]))
# 21.4 s ± 2.86 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

edited Jan 18 '20 at 06:43

answered Jan 17 '20 at 21:53

Alexander

105,104
32
201
196

I'm curious can the for loop approach become very inefficient for large data? – Kenan Jan 18 '20 at 00:29
No, not in this context. – Alexander Jan 18 '20 at 03:46
really? Hmm...wouldn't you be reading `data` line by line at the high level, also the interpreter has to determine the dtype of `d` on each iteration. How does that get ignored in this context? – Kenan Jan 18 '20 at 04:32
As you can see from the timings, my approach beats the pants off the other two. I originally made `data` with 1 million rows where my approach took just over a second. I had to kill it when the faster of the other two methods still hadn't finished 20 minutes later. – Alexander Jan 18 '20 at 06:43
Alexander's aproach has a better performance, however you would need to load the full data set into memory before creating the dataframe. I believe memory is not a problem here, so unless you have a pretty limited system, this is the best option. – Daniel Lima Jan 18 '20 at 10:33
@DanielLima not sure what your point is here, as the data needs to be loaded for all of the answers. In fact, this solution is also the most efficient when it comes to memory use. – Alexander Jan 18 '20 at 17:11
I agree with you @Alexander I just dont see how looping can be so fast. Especially form reading [this](https://stackoverflow.com/questions/8097408/why-python-is-so-slow-for-a-simple-for-loop) – Kenan Jan 18 '20 at 19:15
@Kenan Perhaps this helps to explain loops vs comprehensions: https://stackoverflow.com/questions/22108488/are-list-comprehensions-and-functional-functions-faster-than-for-loops/22108640#22108640. In any case, Pandas needs to iterate over the data (basically a loop) to create the numpy arrays used as the data for the columns. My method is equivalent to `pd.DataFrame.from_records([d['mid'] for d in data])`. By contrast, creating a separate Series for each record creates a lot of overhead and is not very efficient. – Alexander Jan 19 '20 at 05:01

How to take out these elements and put them together into a dataframe

3 Answers3