0

I posted a question earlier ( Pandas-ipython, how to create new data frames with drill down capabilities ) and it was pointed out that it is possibly too broad so I have some more specific questions that may be easier to respond to and help me get a start with graphing data.

I have decided to try creating some visualizations of my data using Pandas (or any package accessible through ipython). The first, obvious, problem I run into is how can I filter on certain conditions. For example I type the command:

df.Duration.hist(bins=10)

but get an error due to unrecognized dtypes (there are some entries that aren't in datetime format). How can I exclude these in the original command?

Also, what if I want to create the same histogram but filtering to keep only records that have id's (in an account id field) starting with the integer (or string?) '2'?

Ultimately, I want to be able to create histograms, line plots, box plots and so on but filtering for certain months, user id's, or just bad 'dtypes'.

Can anyone help me modify the above command to add filters to it. (I'm decent with python-new to data)

tnx

update: a kind user below has been trying to help me with this problem. I have a few developments to add to the question and a more specific problem.

I have columns in my data frame for Start Time and End Time and created a 'Duration' column for time lapsed.

The Start Time/End Time columns have fields that look like:

2014/03/30 15:45

and when I apply the command pd.to_datetime() to these columns I get fields resulting that look like:

2014-03-30 15:45:00

I changed the format to datetime and created a new column which is the 'Duration' or time lapsed in one command:

df['Duration'] = pd.to_datetime(df['End Time'])-pd.to_datetime(df['Start Time'])

The format of the fields in the duration column is:

01:14:00

or hh:mm:ss

to indicate time lapsed or 74 mins in the above example.

the dtype of the duration column fields (hh:mm:ss) is:

dtype('<m8[ns]')  

The question is, how can I convert these fields to just integers?

Community
  • 1
  • 1
pythonista
  • 85
  • 1
  • 2
  • 9
  • Check `df.Duration.dtype`. If it isn't a numeric dtype, you can convert using `df.Duration.astype(int)` or `(float)`, provided that you only have strings containing numbers, and no NaNs. It helps if you post a sample `df.Duration.head(4)`. – dmvianna Aug 04 '14 at 22:30
  • Ok-thanks! I can sort of see what's wrong now but don't know how to get around it. – pythonista Aug 04 '14 at 22:55
  • Copy and paste it in your post. – dmvianna Aug 04 '14 at 22:57
  • oops-I typed df.Duration.dtype and get dtype(' – pythonista Aug 04 '14 at 22:59
  • I read somewhere else that if 'Duration' is in the form dd:mm:ss then Duration.days returns days, Duration.minutes the minutes, Duration.seconds the seconds I could then just define a function which inputs the duration (dur) and outputs Duration.days * 3600 + Duration.minutes + Duration.seconds/60 and then write df.f(dur).hist(bins=10) correct? – pythonista Aug 04 '14 at 23:20
  • No. You don't have durations, you have dates. A duration would be a distance between two dates. numpy has a class for that (`timedelta64`), but to calculate that you need to subtract two dates. – dmvianna Aug 04 '14 at 23:29
  • I do have a duration-I calculated it by subtracting two dates and added it to my data frame with: df['Duration'] = pd.to_datetime(df['End Time'])-pd.to_datetime(df['Start Time']) – pythonista Aug 04 '14 at 23:33

1 Answers1

0

I think you need to convert duration (timedelta64) to int (assuming you have a duration). Then the .hist method will work.

from pandas import Series
from numpy.random import rand
from numpy import timedelta64

In [21]:

a = (rand(3) *10).astype(int)
a
Out[21]:
array([3, 3, 8])
In [22]:

b = [timedelta64(x, 'D') for x in a] # This is a duration
b
Out[22]:
[numpy.timedelta64(3,'D'), numpy.timedelta64(3,'D'), numpy.timedelta64(8,'D')]
In [23]:

c = Series(b) # This is a duration
c
Out[23]:
0   3 days
1   3 days
2   8 days
dtype: timedelta64[ns]
In [27]:

d = c.apply(lambda x: x / timedelta64(1,'D')) # convert duration to int
d
Out[27]:
0    3
1    3
2    8
dtype: float64
In [28]:

d.hist()

I converted the duration to days ('D'), but you can convert it to any legal unit.

dmvianna
  • 15,088
  • 18
  • 77
  • 106
  • ok I follow your process but I have a couple questions: my data is in the form dd:hh:mm whereas yours was just in 'Days'. How can I convert mine to minutes? (once I have minutes then presumably I can just use the same lambda fn to remove the 'm' which will be there from the field) – pythonista Aug 05 '14 at 01:20
  • sorry, my other question is: my dtype shows as: dtype(' – pythonista Aug 05 '14 at 01:21
  • No, as I said, you have a date. Look at my output above, a duration dtype will show as dtype: timedelta64[ns] (if the base is nanosecond) or something similar. You said you subtracted your dates. Did you actually assign that result to a column? Are you referencing to the right column? – dmvianna Aug 05 '14 at 01:25
  • yes, I subtracted two dates and created a new column called duration which has fields that have hh:mm:ss so even though it may be showing a data type as date, it is a duration so I need to figure out how to convert that to an integer, ie. I may have 00:14:00 which is 14 mins or 01:12:00 which is 1 hr and 12 mins... – pythonista Aug 05 '14 at 01:48
  • I did this command: df['Duration'] = pd.to_datetime(df['End Time'])-pd.to_datetime(df['Start Time']) to add the column 'Duration' to my data frame 'df' where the 'End Time' and 'Start Time' columns are of the format: 2014/03/27 23:13 for example-I have checked and the result of this was to give fields in the format of hh:mm:ss of time lapsed – pythonista Aug 05 '14 at 01:57
  • Can you post that in the body of your question, and also a sample of the values within the df? Also, it would be a good idea to do each step at a time (creating a datetime column for each value before subtracting), so you can see when things go wrong. – dmvianna Aug 05 '14 at 03:15
  • Are you doing this interactively? Are you using Python or IPython? – dmvianna Aug 05 '14 at 03:18
  • I am trying to figure out how to get a slice of my data but omitting some of it. I am using ipython and there are, among others, two columns in my data frame called 'Start Time' and 'End Time'. All I did was use the command I wrote above to subtract these two columns and create another which has the correct time lapse-I will edit my question (if I can) to show the result of the to_datetime() transformation. – pythonista Aug 05 '14 at 13:50
  • To convert your timedelta to minutes, you just have to divide it by 1 minute, hence: `df['minutes'] = df.Duration / timedelta(1, 'm')`. Sorry, I thought you had read the last line of my answer. Then you'll be able to do `df.minutes.hist()` without problems. – dmvianna Aug 07 '14 at 04:05
  • Thanks! I ended up using the .str somehow but this is good to know! – pythonista Aug 08 '14 at 13:11
  • 1
    no, .str was somethin else, I did: `df['Durationendminusstart'] = pd.to_timedelta(df.Duration, unit='ns').astype('timedelta64[m]')` – pythonista Aug 08 '14 at 13:17
  • Well I'm happy to know I could contribute something small other than my mess of questions, glad you told me! – pythonista Aug 11 '14 at 02:09