103

Here is a simple example of the code I am running, and I would like the results put into a pandas dataframe (unless there is a better option):

for p in game.players.passing():
    print p, p.team, p.passing_att, p.passer_rating()

R.Wilson SEA 29 55.7
J.Ryan SEA 1 158.3
A.Rodgers GB 34 55.8

Using this code:

d = []
for p in game.players.passing():
    d = [{'Player': p, 'Team': p.team, 'Passer Rating':
        p.passer_rating()}]

pd.DataFrame(d)

I can get:

    Passer Rating   Player      Team
  0 55.8            A.Rodgers   GB

Which is a 1x3 dataframe, and I understand why it is only one row but I can't figure out how to make it multi-row with the columns in the correct order. Ideally the solution would be able to deal with n number of rows (based on p) and it would be wonderful (although not essential) if the number of columns would be set by the number of stats requested. Any suggestions? Thanks in advance!

c.j.mcdonn
  • 1,197
  • 2
  • 9
  • 9
  • You're overwriting your list with each iteration, not appending – Paul H Jan 20 '15 at 23:03
  • Right, I understand what is wrong with it, the problem is I can't figure out how to make it work correctly. This is just the closest I could get. – c.j.mcdonn Jan 21 '15 at 01:24
  • The answer below will work. You could also just do `d.append({'Player': ...})` in your loop. Python docs on lists is pretty good. – Paul H Jan 21 '15 at 01:26
  • 1
    You should also clarify your question to state the real issue: that you're having trouble appending to an empty list. (you seem to understand how to create dataframes from lists of dictionaries very well) – Paul H Jan 21 '15 at 01:30
  • 1
    While I think I understand what you are saying, I believe the question I asked is actually what I would _prefer_, while the code I posted was the closest I could get before asking for help. – c.j.mcdonn Jan 21 '15 at 01:55
  • Most numeric operations with pandas can be vectorized - this means they are much faster than conventional iteration. OTOH, some operations (such as string and regex) are inherently hard to vectorize. This this case, it is important to understand _how_ to loop over your data. More more information on when and how looping over your data is to be done, please read [For loops with Pandas - When should I care?](https://stackoverflow.com/questions/54028199/for-loops-with-pandas-when-should-i-care/54028200#54028200). – cs95 Jan 04 '19 at 10:17

4 Answers4

142

The simplest answer is what Paul H said:

d = []
for p in game.players.passing():
    d.append(
        {
            'Player': p,
            'Team': p.team,
            'Passer Rating':  p.passer_rating()
        }
    )

pd.DataFrame(d)

But if you really want to "build and fill a dataframe from a loop", (which, btw, I wouldn't recommend), here's how you'd do it.

d = pd.DataFrame()

for p in game.players.passing():
    temp = pd.DataFrame(
        {
            'Player': p,
            'Team': p.team,
            'Passer Rating': p.passer_rating()
        }
    )

    d = pd.concat([d, temp])
inc42
  • 109
  • 6
Nick Marinakis
  • 1,776
  • 2
  • 10
  • 12
  • 2
    is it preferable to append a dict to the list and create the `df` only at the end due to superior performance, or just better readability? – ryantuck Aug 18 '15 at 14:16
  • 3
    Performance. To quote the [docs](http://pandas.pydata.org/pandas-docs/version/0.16.2/merging.html#concatenating-objects): ...`concat` (and therefore `append`) makes a full copy of the data, and ... constantly reusing this function can create a signifcant performance hit. – Nick Marinakis Aug 20 '15 at 07:11
  • @NickMarinakis: I don't understand your comment: `if you really want to "build and fill a dataframe from a loop", (which, btw, I wouldn't recommend)`. Then how else can you build the dataframe if not via a loop? – stackoverflowuser2010 Aug 03 '17 at 21:06
  • 6
    @stackoverflowuser2010: So my comment means that you shouldn't create a dataframe and then loop over your data to fill it. Every time you use `pd.concat` you're making a full copy of the data. It's wildly inefficient. Instead, just create a different data structure (e.g. a list of dicts) and then convert that to a dataframe all at once. – Nick Marinakis Aug 04 '17 at 00:02
  • 1
    @NickMarinakis: Ok. In the first part of your answer you're still using a loop (to build up a `list` of `dict` one row at a time) and then converting the whole thing at once to a DataFrame. In the second (worse) solution, you're appending via (`concat`) one DataFrame row at a time. Understood. – stackoverflowuser2010 Aug 04 '17 at 00:20
48

Try this using list comprehension:

import pandas as pd

df = pd.DataFrame(
    [p, p.team, p.passing_att, p.passer_rating()] for p in game.players.passing()
)
cs95
  • 379,657
  • 97
  • 704
  • 746
Amit
  • 19,780
  • 6
  • 46
  • 54
  • Out of the box this gets me the closest to what I was looking for with the columns in the correct order, but I don't know enough about either python or pandas to say if it is the _best_ answer. Thanks for the help everyone. – c.j.mcdonn Jan 21 '15 at 01:53
  • 2
    What is `df` here? – Cai Jun 25 '18 at 11:05
  • @Cai Pandas dataframe – Amit Jun 25 '18 at 11:10
  • 2
    @Amit As in `df = pandas.DataFrame()`? Or as in `from pandas import DataFrame as df`? – Cai Jun 25 '18 at 11:57
  • @Cai - You can create dataframe both ways. In this case the latter is used. – Amit Jun 25 '18 at 12:02
  • 4
    @Amit Ok, then in that case should the solution be `d = df([p, p.team, p.passing_att, p.passer_rating()] for p in game.players.passing())`? (I.e. so `df` is called rather than indexed?) – Cai Jun 25 '18 at 12:03
  • @Amit can you please add more context to your solution? It is not obvious for now what `df` actually is. – dkolmakov Nov 13 '18 at 11:18
  • Having ended up here numerous times and not having found what helped me. Here is the solution that I found [on](https://thispointer.com/python-pandas-how-to-add-rows-in-a-dataframe-using-dataframe-append-loc-iloc/): ```df = pd.DataFrame(columns = headers); df = df.append(pd.Series(mylist, index=df.columns), ignore_index=True) ``` – vsm May 20 '20 at 03:10
  • Essentially, 1) create a DataFrame with the column headers. 2) Create a series with index = column headers of the DataFrame, 3) Append the DataFrame to the accruing DataFrame with ignore_index=True – vsm May 20 '20 at 03:17
36

Make a list of tuples with your data and then create a DataFrame with it:

d = []
for p in game.players.passing():
    d.append((p, p.team, p.passer_rating()))

pd.DataFrame(d, columns=('Player', 'Team', 'Passer Rating'))

A list of tuples should have less overhead than a list dictionaries. I tested this below, but please remember to prioritize ease of code understanding over performance in most cases.

Testing functions:

def with_tuples(loop_size=1e5):
    res = []

    for x in range(int(loop_size)):
        res.append((x-1, x, x+1))

    return pd.DataFrame(res, columns=("a", "b", "c"))

def with_dict(loop_size=1e5):
    res = []

    for x in range(int(loop_size)):
        res.append({"a":x-1, "b":x, "c":x+1})

    return pd.DataFrame(res)

Results:

%timeit -n 10 with_tuples()
# 10 loops, best of 3: 55.2 ms per loop

%timeit -n 10 with_dict()
# 10 loops, best of 3: 130 ms per loop
Seanny123
  • 8,776
  • 13
  • 68
  • 124
  • I tried this in my code and it works amazing with the tuple. Just wondering that Tuple are immutable. So how are we able to append them ? – Sumit Pokhrel Mar 03 '20 at 22:09
  • 1
    @SumitPokhrel Tuples are immutable, but they aren't being mutated by the `append`. The List is being appended to and is thus what is being mutated. – Seanny123 Mar 03 '20 at 22:10
  • Don't you think appending something is mutating or changing it from it's original form ? If List is being mutated by Append then why Tuple isn't being mutated by Append ? – Sumit Pokhrel Mar 03 '20 at 22:14
  • 1
    @SumitPokhrel because you append tuples to the list: `res=[(1,2)]` first, and then `res.append((3,4))` gives `[(1,2),(3,4)]` So the tuples are not mutated – Fee Jun 27 '21 at 13:46
1

I may be wrong, but I think the accepted answer by @amit has a bug.

from pandas import DataFrame as df
x = [1,2,3]
y = [7,8,9,10]

# this gives me a syntax error at 'for' (Python 3.7)
d1 = df[[a, "A", b, "B"] for a in x for b in y]

# this works
d2 = df([a, "A", b, "B"] for a in x for b in y)

# and if you want to add the column names on the fly
# note the additional parentheses
d3 = df(([a, "A", b, "B"] for a in x for b in y), columns = ("l","m","n","o"))
bzip2
  • 103
  • 1
  • 3