8

I'm starting to learn python, numpy and panda's and I have a really basic question, about sizes.

Please see the next code blocks:

1. Length: 6, dtype: int64

# create a Series from a dict
pd.Series({key: value for key, value in zip('abcdef', range(6))})

vs.

2. Length: 6, dtype: int32

# but why does this generate a smaller integer size???
pd.Series(range(6), index=list('abcdef'))

Question So I think when you put a list, numpy array, dictionary etc. in the pd.Series you will get int64 but when you put just the range(6) in the pd.Series you will get int32. Can someone please make this a little bit clear to me?

Sorry for the very basic question.

@Edit : I'm using Pandas version 0.20.1 and Numpy 1.12.1

Mike Evers
  • 185
  • 1
  • 4
  • 15

1 Answers1

4

They're semantically different in that in the first version you pass a dict with a single scalar value so the dtype becomes int64, for the second, you pass a range which can be trvially converted to a numpy array and this is int32:

In[57]:
np.array(range(6)).dtype

Out[57]: dtype('int32')

So the construction of the pandas seriesinvolves a dtype matching in the first instance and none for the second because it's convertible to a numpy array and numpy has determined that int32 is preferred in this case

update

It looks like this is dependant on your numpy version and maybe pandas version. I'm running python 3.6, numpy 1.12.1 and pandas 0.20.3 and I get the above result. I'm also running Windows 7 64-bit

@jeremycg is running pandas 0.19.2 and numpy 1.11.2 and observes the same result whilst @coldspeed is running numpy 1.13.1 and observes int64.

The takeaway from this that the dtype will largely be determined by what numpy does.

I believe that this line is what is called when we pass range in this case.

subarr = np.array(arr, dtype=object, copy=copy)

The returned type is determined by numpy and OS, in my case windows has defined a C Long as being 32-bits. See related: numpy array dtype is coming as int32 by default in a windows 10 64 bit machine

EdChum
  • 376,765
  • 198
  • 813
  • 562
  • I got both the code output int64 dtype? Does it depend on pandas version – Bharath M Shetty Sep 15 '17 at 13:13
  • I'm using `0.20.3` – EdChum Sep 15 '17 at 13:15
  • `np.array(list(range(6))).dtype` return int64 in my machine. – Bharath M Shetty Sep 15 '17 at 13:15
  • @EdChum What numpy version are you running? – cs95 Sep 15 '17 at 13:16
  • I'm using python 3.6 and numpy 1.12.1 all 64-bit – EdChum Sep 15 '17 at 13:16
  • 2
    That's probably the reason. My numpy is 1.13.1 and I'm also getting Bharath's results. – cs95 Sep 15 '17 at 13:17
  • pandas 0.19.2 and numpy 1.11.2 gives me int32, like @EdChum and the OP – jeremycg Sep 15 '17 at 13:18
  • @cᴏʟᴅsᴘᴇᴇᴅ surprised this behaviour has changed, but essentially `pandas` will apply `np.array` on the passed in data if it's trivial so the `dtype` is derived from this – EdChum Sep 15 '17 at 13:18
  • @EdChum can you add the version of libraries in your answer. – Bharath M Shetty Sep 15 '17 at 13:19
  • I think this is an OS issue rather than package version. Are you by any chance on Windows? – ayhan Sep 15 '17 at 13:23
  • I'm running windows 7 64-bit – EdChum Sep 15 '17 at 13:23
  • 2
    This must be the case: https://stackoverflow.com/questions/36278590/numpy-array-dtype-is-coming-as-int32-by-default-in-a-windows-10-64-bit-machine – ayhan Sep 15 '17 at 13:24
  • @ayhan that is probably the reason, should we mark as dupe? – EdChum Sep 15 '17 at 13:25
  • I think this still has to do with how pandas decides which data type to use so I'd say not an exact duplicate. – ayhan Sep 15 '17 at 13:28
  • 2
    @ayhan, I believe that pandas will try to call the `np.array` ctor on the passed in data if it's iterable or array-like so the dtype will come from `numpy`. in the first case, the default will be `int64` for scalar types passed in this form – EdChum Sep 15 '17 at 13:29
  • 1
    @ayhan I think that this line: https://github.com/pandas-dev/pandas/blob/83436af8ae1ccad49b7ceac7471c060d823d10ab/pandas/core/series.py#L2994 is what is eventually called in this case – EdChum Sep 15 '17 at 13:37
  • Yes it seems a numpy array is still constructed but dtype decision is not on numpy's part. – ayhan Sep 15 '17 at 13:42
  • 1
    @ayhan I'd expect that once we have numpy arrays and the dtype is not mixed that the dtype will just be passed straight through, I'm trying to search through the code to find anything explicit but it seems to make a Singleblockmanager with the array and then call `NDFrame.__init__` where it will just make a `copy` – EdChum Sep 15 '17 at 13:50