141

When using R it's handy to load "practice" datasets using

data(iris)

or

data(mtcars)

Is there something similar for Pandas? I know I can load using any other method, just curious if there's anything builtin.

smci
  • 32,567
  • 20
  • 113
  • 146
canyon289
  • 3,355
  • 4
  • 33
  • 41
  • 3
    Possible duplicate of [Are there any example data sets for Python?](http://stackoverflow.com/questions/16579407/are-there-any-example-data-sets-for-python) – a different ben May 11 '17 at 07:48

5 Answers5

175

Since I originally wrote this answer, I have updated it with the many ways that are now available for accessing sample data sets in Python. Personally, I tend to stick with whatever package I am already using (usually seaborn or pandas). If you need offline access, installing the data set with Quilt seems to be the only option.

Seaborn

The brilliant plotting package seaborn has several built-in sample data sets.

import seaborn as sns

iris = sns.load_dataset('iris')
iris.head()
   sepal_length  sepal_width  petal_length  petal_width species
0           5.1          3.5           1.4          0.2  setosa
1           4.9          3.0           1.4          0.2  setosa
2           4.7          3.2           1.3          0.2  setosa
3           4.6          3.1           1.5          0.2  setosa
4           5.0          3.6           1.4          0.2  setosa

Pandas

If you do not want to import seaborn, but still want to access its sample data sets, you can use @andrewwowens's approach for the seaborn sample data:

iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')

Note that the sample data sets containing categorical columns have their column type modified by sns.load_dataset() and the result might not be the same by getting it from the url directly. The iris and tips sample data sets are also available in the pandas github repo here.

R sample datasets

Since any dataset can be read via pd.read_csv(), it is possible to access all R's sample data sets by copying the URLs from this R data set repository.

Additional ways of loading the R sample data sets include statsmodel

import statsmodels.api as sm

iris = sm.datasets.get_rdataset('iris').data

and PyDataset

from pydataset import data

iris = data('iris')

scikit-learn

scikit-learn returns sample data as numpy arrays rather than a pandas data frame.

from sklearn.datasets import load_iris

iris = load_iris()
# `iris.data` holds the numerical values
# `iris.feature_names` holds the numerical column names
# `iris.target` holds the categorical (species) values (as ints)
# `iris.target_names` holds the unique categorical names

Quilt

Quilt is a dataset manager created to facilitate dataset management. It includes many common sample datasets, such as several from the uciml sample repository. The quick start page shows how to install and import the iris data set:

# In your terminal
$ pip install quilt
$ quilt install uciml/iris

After installing a dataset, it is accessible locally, so this is the best option if you want to work with the data offline.

import quilt.data.uciml.iris as ir

iris = ir.tables.iris()
   sepal_length  sepal_width  petal_length  petal_width        class
0           5.1          3.5           1.4          0.2  Iris-setosa
1           4.9          3.0           1.4          0.2  Iris-setosa
2           4.7          3.2           1.3          0.2  Iris-setosa
3           4.6          3.1           1.5          0.2  Iris-setosa
4           5.0          3.6           1.4          0.2  Iris-setosa

Quilt also support dataset versioning and include a short description of each dataset.

joelostblom
  • 43,590
  • 17
  • 150
  • 159
38

The builtin pandas testing DataFrame is very convenient.

makeMixedDataFrame():

In [22]: import pandas as pd

In [23]: pd.util.testing.makeMixedDataFrame()
Out[23]:
     A    B     C          D
0  0.0  0.0  foo1 2009-01-01
1  1.0  1.0  foo2 2009-01-02
2  2.0  0.0  foo3 2009-01-05
3  3.0  1.0  foo4 2009-01-06
4  4.0  0.0  foo5 2009-01-07

other testing DataFrame options:

makeDataFrame():

In [24]: pd.util.testing.makeDataFrame().head()
Out[24]:
                   A         B         C         D
acKoIvMLwE  0.121895 -0.781388  0.416125 -0.105779
jc6UQeOO1K -0.542400  2.210908 -0.536521 -1.316355
GlzjJESv7a  0.921131 -0.927859  0.995377  0.005149
CMhwowHXdW  1.724349  0.604531 -1.453514 -0.289416
ATr2ww0ctj  0.156038  0.597015  0.977537 -1.498532

makeMissingDataframe():

In [27]: pd.util.testing.makeMissingDataframe().head()
Out[27]:
                   A         B         C         D
qyXLpmp1Zg -1.034246  1.050093       NaN       NaN
v7eFDnbQko  0.581576  1.334046 -0.576104 -0.579940
fGiibeTEjx -1.166468 -1.146750 -0.711950 -0.205822
Q8ETSRa6uY  0.461845 -2.112087  0.167380 -0.466719
7XBSChaOyL -1.159962 -1.079996  1.585406 -1.411159

makeTimeDataFrame():

In [28]: pd.util.testing.makeTimeDataFrame().head()
Out[28]:
                   A         B         C         D
2000-01-03 -0.641226  0.912964  0.308781  0.551329
2000-01-04  0.364452 -0.722959  0.322865  0.426233
2000-01-05  1.042171  0.005285  0.156562  0.978620
2000-01-06  0.749606 -0.128987 -0.312927  0.481170
2000-01-07  0.945844 -0.854273  0.935350  1.165401
cheng10
  • 714
  • 7
  • 10
  • 9
    Your answer is underrated- although perhaps it's not exactly what the question was asking for, what I really want is an interactive way to get a dataframe to play with. Thanks! – tomaszps Jan 27 '21 at 19:28
  • 3
    @cheng10 this answer is deprecated. – G. Macia Aug 02 '22 at 19:33
  • @G.Macia: Can you explain? these would likely have been replaced if this calling method is deprecated? – CPBL Nov 13 '22 at 22:05
  • What's to explain? As @G.Macia states, Pandas deprecated the entirety of the `pandas.util.testing` submodule – including *all* of the functionality called above. Ergo, this answer no longer applies. There are no well-maintained alternatives to this functionality (that I know of), but Pandas developers don't appear to especially care. Concerned data scientists (*this means you!*) should begin banging on the Pandas issue tracker about this. – Cecil Curry Mar 31 '23 at 04:10
  • 1
    The package was moved to `pandas._testing`. See [documentation](https://pandas.pydata.org/docs/user_guide/reshaping.html?highlight=maketimedataframe) for an example. – Night Train Jul 27 '23 at 16:43
16

Any publically available .csv file can be loaded into pandas extremely quickly using its URL. Here is an example using the iris dataset originally from the UCI archive.

import pandas as pd

file_name = "https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv"
df = pd.read_csv(file_name)
df.head()

The output here being the .csv file header you just loaded from the given URL.

>>> df.head()
   sepal_length  sepal_width  petal_length  petal_width species
0           5.1          3.5           1.4          0.2  setosa
1           4.9          3.0           1.4          0.2  setosa
2           4.7          3.2           1.3          0.2  setosa
3           4.6          3.1           1.5          0.2  setosa
4           5.0          3.6           1.4          0.2  setosa

A memorable short URL for the same is https://j​.mp/iriscsv. This short URL will work only if it's typed and not if it's copy-pasted.

Asclepius
  • 57,944
  • 17
  • 167
  • 143
unique_beast
  • 1,379
  • 2
  • 11
  • 23
  • The website is not down. Check https://archive.ics.uci.edu/ml/datasets/Iris for description, or download `iris.names` – zhazha Nov 21 '18 at 10:32
15

The rpy2 module is made for this:

from rpy2.robjects import r, pandas2ri
pandas2ri.activate()

r['iris'].head()

yields

   Sepal.Length  Sepal.Width  Petal.Length  Petal.Width Species
1           5.1          3.5           1.4          0.2  setosa
2           4.9          3.0           1.4          0.2  setosa
3           4.7          3.2           1.3          0.2  setosa
4           4.6          3.1           1.5          0.2  setosa
5           5.0          3.6           1.4          0.2  setosa

Up to pandas 0.19 you could use pandas' own rpy interface:

import pandas.rpy.common as rcom
iris = rcom.load_data('iris')
print(iris.head())

yields

   Sepal.Length  Sepal.Width  Petal.Length  Petal.Width Species
1           5.1          3.5           1.4          0.2  setosa
2           4.9          3.0           1.4          0.2  setosa
3           4.7          3.2           1.3          0.2  setosa
4           4.6          3.1           1.5          0.2  setosa
5           5.0          3.6           1.4          0.2  setosa

rpy2 also provides a way to convert R objects into Python objects:

import pandas as pd
import rpy2.robjects as ro
import rpy2.robjects.conversion as conversion
from rpy2.robjects import pandas2ri
pandas2ri.activate()

R = ro.r

df = conversion.ri2py(R['mtcars'])
print(df.head())

yields

    mpg  cyl  disp   hp  drat     wt   qsec  vs  am  gear  carb
0  21.0    6   160  110  3.90  2.620  16.46   0   1     4     4
1  21.0    6   160  110  3.90  2.875  17.02   0   1     4     4
2  22.8    4   108   93  3.85  2.320  18.61   1   1     4     1
3  21.4    6   258  110  3.08  3.215  19.44   1   0     3     1
4  18.7    8   360  175  3.15  3.440  17.02   0   0     3     2
Patrick FitzGerald
  • 3,280
  • 2
  • 18
  • 30
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • 1
    Thanks for suggestion. I was doing this but it violates the "ease" that that the data is available in R. It is a solution that gets it done though! – canyon289 Feb 09 '15 at 19:39
  • 3
    Hm? what is so hard about `rcom.load_data('iris')`? – unutbu Feb 09 '15 at 19:59
  • Likely nothing, I realize I may be being too picky. I appreciate the answer! – canyon289 Feb 09 '15 at 21:43
  • 1
    Note that `pandas.rpy` was [removed in 0.20](http://pandas.pydata.org/pandas-docs/version/0.20/whatsnew.html#removal-of-prior-version-deprecations-changes). To interface with R, `rpy2` is the recommended option. – joelostblom May 10 '17 at 15:53
4

I made some public datasets available in this github repo. You can load them via pd.read_csv(url_to_file.csv)

iris

iris = pd.read_csv("https://raw.githubusercontent.com/practiceprobs/datasets/main/iris/iris.csv")
iris.head()
   sepal_length  sepal_width  petal_length  petal_width      species
0           5.1          3.5           1.4          0.2  Iris-setosa
1           4.9          3.0           1.4          0.2  Iris-setosa
2           4.7          3.2           1.3          0.2  Iris-setosa
3           4.6          3.1           1.5          0.2  Iris-setosa
4           5.0          3.6           1.4          0.2  Iris-setosa

mnist

mnist = pd.read_csv("https://raw.githubusercontent.com/practiceprobs/datasets/main/MNIST/mnist.csv")
mnist.head()
   label  1x1  1x2  1x3  1x4  1x5  ...  28x23  28x24  28x25  28x26  28x27  28x28
0      7    0    0    0    0    0  ...      0      0      0      0      0      0
1      2    0    0    0    0    0  ...      0      0      0      0      0      0
2      1    0    0    0    0    0  ...      0      0      0      0      0      0
3      0    0    0    0    0    0  ...      0      0      0      0      0      0
4      4    0    0    0    0    0  ...      0      0      0      0      0      0
[5 rows x 785 columns]

netflix titles

netflix = pd.read_csv("https://raw.githubusercontent.com/practiceprobs/datasets/main/netflix-titles/netflix-titles.csv")
netflix.head()
  show_id  ...                                        description
0      s1  ...  As her father nears the end of his life, filmm...
1      s2  ...  After crossing paths at a party, a Cape Town t...
2      s3  ...  To protect his family from a powerful drug lor...
3      s4  ...  Feuds, flirtations and toilet talk go down amo...
4      s5  ...  In a city of coaching centers known to train I...
[5 rows x 12 columns]
Ben
  • 20,038
  • 30
  • 112
  • 189