0

I think my problem is simple but I've made a long post in the interest of being thorough.

I need to visualize some data but first I need to perform some calculations that seem too cumbersome in Tableau (am I hated if I say tableau sucks!)

I have a general problem with how to output data with my calculations in a nice format that can be visualized either in Tableau or something else so it needs to hang on to a lot of information.

My data set is a number of fields associated to usage of an application by user id. So there are potentially multiple entries for each user id and each entry (record) has information in columns such as time they began using app, end time, price they paid, whether they were on wifi, and other attributes (dimensions).

I have one year of data and want to do things like calculate average/total of duration/price paid in app over each month and over the full year of each user (remember each user will appear multiple times-each time they sign in).

I know some basics, like appending a column which subtracts start time from end time to get time spent and my python is fully functional but my data capabilities are amateur.

My question is, say I want the following attributes (measures) calculated (all per user id): average price, total price, max/min price, median price, average duration, total duration, max/min duration, median duration, and number of times logged in (so number of instances of id) and all on a per month and per year basis. I know that I could calculate each of these things but what is the best way to store them for use in a visualization?

For context, I may want to visualize the group of users who paid on average more than 8$ and were in the app a total of more than 3 hours (to this point a simple new table can be created with the info) but if I want it in terms of what shows they watched and whether they were on wifi (other attributes in the original data set) and I want to see it broken down monthly, it seems like having my new table of calculations won't cut it.

Would it then be best to create a yearly table and a table for each month for a total of 13 tables each of which contain the user id's over that time period with all the original information and then append a column for each calculation (if the calc is an avg then I enter the same value for each instance of an id)?

I searched and found that maybe the plyr functionality in R would be useful but I am very familiar with python and using ipython. All I need is a nice data set with all this info that can then be exported into a visualization software unless you can also suggest visualization tools in ipython :)

Any help is much appreciated, I'm so hoping it makes sense to do this in python as tableau is just painful for the calculation side of things....please help :)

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
pythonista
  • 85
  • 1
  • 2
  • 9

1 Answers1

0

It sounds like you want to run a database query like this:

SELECT user, show, month, wifi, sum(time_in_pp)
GROUP BY user, show, month, wifi
HAVING sum(time_in_pp) > 3

Put it into a database and run your queries using pandas sql interface or ordinary python queries. Presumably you index your database table on these columns.

hughdbrown
  • 47,733
  • 20
  • 85
  • 108
  • Someone at work suggested a database query using SQL, maybe I am just naive but it seems like that won't get me to a visualization. I would ultimately want to have a dataset in Tableau that can be visualized so I was hoping to create this dataset in ipython using Pandas and then export it as an excel file. – pythonista Aug 03 '14 at 19:17
  • There is nothing I am suggesting that rules out creating an excel or CSV file. Check here: http://pandas.pydata.org/pandas-docs/stable/io.html – hughdbrown Aug 03 '14 at 21:22
  • I'm not sure you're going to get an exceptionally illuminating answer for this question because it is very broad. The best way to store data depends a lot on what sort of visualization you're looking to create. Time series needs to be processed in a different way than a histogram... Python+Pandas+IPython+Matplotlib can be used for doing nearly everything you want to do. Pandas in particular can easily manipulate the data. If you narrow down what you're looking for to a very specific question- how to go from a to b, I'm sure plenty of people will help guide you. – DataSwede Aug 04 '14 at 18:55
  • Thanks for all the direction. I can for example compute the average duration per user over the year but then what-I create a histogram to show the average duration yearly as a distribution over numbers of users but I've lost all other information like whether those users were primarily on wifi and what they did monthly so somehow I need to reference the original data. My over-arching question is how do people go back and forth between their calculations and their visualizations most efficiently...Is it just a matter of doing the calculations individually for each visualization you want? – pythonista Aug 04 '14 at 19:20