1

I have some specific questions about whether to use Pandas or alternative tools.

  • What is the reason to use Pandas rather than other tools or data structures?

  • When memory is a concern, how heavy is the cost of Pandas and what are the cheaper alternatives?


This is more of a qualitative question. What is the purpose of pandas? I find dictionaries and lists to fit my needs entirely. What's the big fuss with pandas?

For example I can store this table in a nested dictionary using much less memory, if there are lots of rows with identical values:

#key0    key1    value
A        1       a
A        1       b
A        2       a
A        2       b
B        1       a
B        1       b
B        2       a
B        2       b

d = {'A': {1: ['a', 'b'], 'A': {2: ['a', 'b'], 'B': {1: ['a', 'b'], 'B': {2: ['a', 'b']}}

Why would I want to use pandas, when there is a much more memory efficient way of holding my nested data? I just don't get it. Thanks!

I'm aware of the abilities of pandas to allow indexing by name, handle missing data, doing join, group by a value and so forth.

This is more of a qualitative question. Perhaps it belongs on Meta Stack Exchange instead.

Raymond Hettinger
  • 216,523
  • 63
  • 388
  • 485
tommy.carstensen
  • 8,962
  • 15
  • 65
  • 108
  • 2
    Because usually its better to optimize for programmer efficiency than memory efficiency. Pandas has lots of great tooling and a great library. – marisbest2 Mar 22 '17 at 15:16
  • Thanks! I might delete the question. It got down voted and it was suggested to be closed, because it's too broad. I guess that's true. – tommy.carstensen Mar 22 '17 at 15:17
  • 2
    Is there something you didn't understand from the intro in the [docs](http://pandas.pydata.org/pandas-docs/stable/index.html)? – EdChum Mar 22 '17 at 15:20
  • Reworded the question introduction to limit it to a specific question about what capabilities Pandas offers over dicts and lists and about the relative memory costs of Pandas, vs lists/dicts, vs array.array, vs numpy.array. – Raymond Hettinger Mar 22 '17 at 22:37

1 Answers1

5

1) What is the purpose of pandas? What's the big fuss with pandas?

Pandas in primarily known for its ability to load information into dataframes which allows code to reason about columns of data at a time.

Here's the description from the Pandas docs:

Pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language. It is already well on its way toward this goal.

2) Why would I want to use pandas, when there is a much more memory efficient way of holding my nested data?

According to this SO answer, the memory overhead for Pandas isn't that bad.

That said, if memory is a key constraint, you can do better than even python dicts and lists both of which keep references to boxed data (values stored in objects). Instead, you can use denser data structures that have unboxed data. One choice with be Python's array module or you can use numpy arrays.

Community
  • 1
  • 1
Raymond Hettinger
  • 216,523
  • 63
  • 388
  • 485