Column-wise input nature of a dataframe

Question

I am curious why the default way to pass data to a dataframe is column-based, rather than row-based. For example, something like:

d = {'name': ['John', 'Peter'], 'age': [10, 20]}
df = pd.DataFrame(data=d)
   age   name
0   10   John
1   20  Peter

Instead of something like:

[
  {'name': 'John', 'age': 10}, 
  {'name': 'Peter', 'age': 20}
]

Each row would be an observation in statistical terms, so I would think the intuitive or 'correct' way to enter data would be that way (similar to how data would be inserted in SQL or represented, usually, in xml or json). The only thing I can think of is potentially for historical reasons that perhaps the basic unit of entry was a typed array/vector, and the language (or library it depended on) didn't have something like a (mixed-type) tuple or struct. But this is just a guess and I'l like to understand actually why this is the case.

Can you clarify exactly what you're seeking? The question as it is now is ambiguous, you stated "*I am curious why the default way to pass data to a dataframe is column-based, rather than row-based*". It's just incorrect, period. You can pass records as rows and the exact same dataframe will be created as if you had passed a dictionary of columns. Now this question had attracted a lot of answers, most of which contain incorrect facts or don't address the question. — mozway, Aug 28 '23 at 04:02
For instance, Yilmaz's claim that you would need to iterate to increment values if the dataframe was created from rows is nonsense. Of course you can always use vectorized code whether you passed a dictionary or a list as input. Pandas doesn't "remember" what was the input and it converted everything to numpy array internally. Nikhil's comment on the zen seems also offtopic... It's absurd to reimplement the DataFrame constructor to do what is was already done, but worse... In summary, your question as currently asked only has one answer: "*there is no column default to construct a DataFrame*" — mozway, Aug 28 '23 at 04:08
@mozway I agree, let me write up some improvements to the question... — David542, Aug 29 '23 at 18:32

mozway · Answer 1 · 2023-08-18T20:52:31.843

1

There are many ways to pass data to the DataFrame constructor. Your second example is actually a valid default.

pd.DataFrame([
  {'name': 'John', 'age': 10}, 
  {'name': 'Peter', 'age': 20}
])

Output:

    name  age
0   John   10
1  Peter   20

Pandas recognizes automatically the format of data, a dictionary of lists will represent columns. A list of dictionaries will be rows. A dictionary of dictionaries will also be rows-oriented.

See the DataFrame documentation for more examples.

edited Aug 18 '23 at 20:52

answered Aug 18 '23 at 20:47

mozway

194,879
13
39
75

sure, and you can also pass Dataclasses, Series, etc. to it. My question is more about why the 'default' it uses (and how most all examples) use the column orientation? – David542 Aug 18 '23 at 20:58
2

@David542 it's true that it's a common way, I don't think I would call it a default. You can also pass a full 2D array. What often makes sense it to pass columns when you have different types. – mozway Aug 18 '23 at 21:05

score 1 · Answer 2 · answered Aug 21 '23 at 00:42

1

You can using row notation to construct a dataframe like this:

import pandas as pd

d = [['John', 10], ['Peter', 20]]
df = pd.DataFrame(d, columns=['Name', 'Age'])

df

Output:

    Name  Age
0   John   10
1  Peter   20

answered Aug 21 '23 at 00:42

Scott Boston

147,308
15
139
187

Musabbir Arrafi · Answer 3 · 2023-08-22T17:34:56.957

To answer your question, in the context of pandas, the primary reason you often see examples with column-oriented data is that DataFrames are designed to represent tabular data where each column corresponds to a specific attribute or variable. This aligns with how data is typically organized in databases and spreadsheets. DataFrames are widely used to represent structured tabular data, where each column corresponds to a specific attribute or measurement. This makes column-oriented examples more intuitive for beginners. Besides in many real-world scenarios, it's easier to collect and manage data in a column-oriented manner. For example, each observation (row) might have multiple attributes, and adding new observations can be as simple as appending a new row.

But you can absolutely put data into an existing DataFrame row-wise.

Solution 1: `DataFrame.loc[]` indexer

Here's an example I use often with .loc indexer to assign the data to specific rows in the DataFrame:

import pandas as pd

# Create an empty DataFrame with column names
columns = ['Name', 'Age', 'Country']
data_frame = pd.DataFrame(columns=columns)

# Data for each row
row1_data = ['Alice', 25, 'USA']
row2_data = ['Bob', 32, 'Canada']
row3_data = ['Eve', 28, 'UK']

# Inserting data into the DataFrame row-wise
data_frame.loc[0] = row1_data
data_frame.loc[1] = row2_data
data_frame.loc[2] = row3_data

# Append new data at the end
new_row = ['Rafi', 28, 'Dhaka']
data_frame.loc[len(data_frame)] = new_row

# Display the DataFrame
print(data_frame)

Output:

    Name Age Country
0  Alice  25     USA
1    Bob  32  Canada
2    Eve  28      UK
3   Rafi  28   Dhaka

You can just iterate through a list of data and always append it as the last entry of the dataframe this way.

Solution 2: Using `DataFrame.from_records()`

import pandas as pd

# Data for multiple rows as tuples
data = [('Alice', 25, 'USA'),
        ('Bob', 32, 'Canada'),
        ('Eve', 28, 'UK')]

# Column names
columns = ['Name', 'Age', 'Country']

# Create DataFrame from records
data_frame = pd.DataFrame.from_records(data, columns=columns)

# Display the DataFrame
print(data_frame)

Output:

    Name Age Country
0  Alice  25     USA
1    Bob  32  Canada
2    Eve  28      UK

Solution 3: Using `DataFrame.from_dict()`

you can add multiple rows or a single row as dict(). This is especially useful if you have a .json file to pass as data

import pandas as pd

# Data as a list of dictionaries
data = [
    {'Name': 'Alice', 'Age': 25, 'Country': 'USA'},
    {'Name': 'Bob', 'Age': 32, 'Country': 'Canada'},
    {'Name': 'Eve', 'Age': 28, 'Country': 'UK'}
]

# Create DataFrame from dictionary
data_frame = pd.DataFrame.from_dict(data)

# Display the DataFrame
print(data_frame)

Output:

    Name Age Country
0  Alice  25     USA
1    Bob  32  Canada
2    Eve  28      UK

Hope it helps!

Can you please clarify: "This aligns with how data is typically organized in databases, spreadsheets, and CSV files." How would a csv file, for example, be organized "column-wise" ? In fact, it's row-wise. — David542, Aug 22 '23 at 17:10
sorry I might be a little misleading there, CSV file is not a good example. But spreadsheets and databases are. I meant in the context of data organizing, normally each column represents a feature/attribute of the data. I will edit that part — Musabbir Arrafi, Aug 22 '23 at 17:33

score 1 · Answer 4 · answered Aug 26 '23 at 01:33

It makes vectorized operations easier. the example you have

data = {
'name': ['John', 'Peter'],
'age': [10, 20]
}

df=pd.DataFrame(data)

if you want to add 5 to each age

df['age'] = df['age'] + 5

if you had row based storage:

data = [
    {'name': 'John', 'age': 10}, 
    {'name': 'Peter', 'age': 20}
]

to add 5 to age we have to do iteration

for person in data:
    person['age'] += 5

you can read this Why is vectorization, faster in general, than loops?

another reason https://www.oracle.com/a/ocom/docs/database/hybrid-columnar-compression-brief.pdf

Storing column data together, with the same data type and similar characteristics, dramatically increases the storage savings achieved from compression. However, storing data in this manner can negatively influence database performance when application queries access more than one or two columns, perform even a modest number of updates or insert small numbers of rows per transaction.

Compression algorithms often work better when applied to similar data values, leading to reduced memory usage.

nikhil swami · Answer 5 · 2023-08-27T22:05:09.010

I personally believe in Zen of python which says:

1 Beautiful is better than ugly.
2 Explicit is better than implicit.
3 Simple is better than complex.
4 Complex is better than complicated.
5 Flat is better than nested.
6 Sparse is better than dense.
7 Readability counts.
8 Special cases aren't special enough to break the rules.
9 Although practicality beats purity.
10 Errors should never pass silently.
11 Unless explicitly silenced.
12 In the face of ambiguity, refuse the temptation to guess.
13 There should be one-- and preferably only one --obvious way to do it.
14 Although that way may not be obvious at first unless you're Dutch.
15 Now is better than never.
16 Although never is often better than *right* now.
17 If the implementation is hard to explain, it's a bad idea.
18 If the implementation is easy to explain, it may be a good idea.
19 Namespaces are one honking great idea -- let's do more of those!

the way pandas create dataframe was implemented , it violates [1, 2, 3, 5, 6, 7, 9, 13, 14, 17] some fundamental laws of nature, ie. intuition! so here is as helper function to ease the plight of our fellow developers.

import pandas as pd

def DataFrame(d):
    """ 
        if keys and values are given use normal method,
        else if list of dict is given , convert to tuple like format
    """

    if isinstance(d, dict):
        return pd.DataFrame(data=d)
        
    if isinstance(d, list):
        schema = list(d[0].keys())
        # keep ordering of keys
        data = [(d[i][key] for key in schema) for i in range(len(d))]
        print(data)
        return pd.DataFrame(data=data, columns=schema)
            

d1 = {'name': ['John', 'Peter'], 'age': [10, 20]}
d2 = [
    {'name': 'John', 'age': 10}, 
    {'name': 'Peter', 'age': 20}
]
print(DataFrame(d1))
print(DataFrame(d2))

Output

    name  age
0   John   10
1  Peter   20
    name  age
0   John   10
1  Peter   20

Column-wise input nature of a dataframe

5 Answers5

Solution 1: DataFrame.loc[] indexer

Solution 2: Using DataFrame.from_records()

Solution 3: Using DataFrame.from_dict()

Output

Solution 1: `DataFrame.loc[]` indexer

Solution 2: Using `DataFrame.from_records()`

Solution 3: Using `DataFrame.from_dict()`