1

I have the following data. I have over 100k records so it's a big file and I'm only showing a portion of it.

import pandas as pd df1 = pd.DataFrame(data) print (df1)

   ADDRESS        |    ID  |     DATE     | VIOLATIONS
0  7738 S WESTERN | CHI065 |   2014-07-08 |        65
1  1111 N HUMBOLDT| CHI010 |   2014-07-16 |         56
2  5520 S WESTERN | CHI069 |   2014-07-08 |         10
3  1111 N HUMBOLDT| CHI010 |   2014-07-26 |        101
4  1111 N HUMBOLDT| CHI010 |   2014-07-27 |         92
5  5529 S WESTERN | CHI068 |   2014-08-03 |         20

Q1. I need to figure out the average number of violations issued per camera, per day? Q2. on which day of the week are the most citations issued? Q3 Has the number of active cameras increased or decreased over the collection period.

I'm still stuck on the first one. I'm able to get avg of violations by date. The output looks like the following

df1.groupby('DATE').VIOLATIONS.mean()

DATE       |
2014-07-01 |  52.168421
2014-07-02 |   43.228261
2014-07-03 |  51.617021
2014-07-04 |   59.596774
2014-07-05 |   55.380952
2014-07-06 |   59.983333
2014-07-07 |   49.237113

but when I changed it by adding ID it gives me error.

df1.groupby(['DATE', 'ID']).VIOLATIONS.mean()

Help would be much appreciate it! Thanks!

waiwai
  • 19
  • 3
  • What error did you get? – Josmoor98 Aug 06 '19 at 08:27
  • KeyError Traceback (most recent call last) in 2 import pandas as pd 3 df1 = pd.DataFrame(data) ----> 4 df1.groupby(['DATE', 'ID']).VIOLATIONS.mean() it doesn't like it when i try to add "ID" – waiwai Aug 06 '19 at 13:56

1 Answers1

0
  1. Not sure what error you received, but using your example, the following should work.
In [1]: df = pd.DataFrame([["7738 S WESTERN", "CHI065", "2014-07-08", 65],
                           ["1111 N HUMBOLDT", "CHI010","2014-07-16", 56],
                           ["5520 S WESTERN", "CHI069", "2014-07-08", 10],
                           ["1111 N HUMBOLDT", "CHI010", "2014-07-26", 101],
                           ["1111 N HUMBOLDT", "CHI010", "2014-07-27", 92],
                           ["5529 S WESTERN", "CHI068", "2014-08-03", 20]], 
                           columns = ["ADDRESS", "ID", "DATE", "VIOLATIONS"])

Then the following should yield the answer you're looking for.

In [2]: df.groupby(['DATE', 'ID'])['VIOLATIONS'].mean()

Out[2]:        DATE      ID    
         2014-07-08  CHI065     65
                     CHI069     10
         2014-07-16  CHI010     56
         2014-07-26  CHI010    101
         2014-07-27  CHI010     92
         2014-08-03  CHI068     20
  1. To determine the day with the maximum number of violations for all Addresses.
df['DATE'] = pd.to_datetime(df['DATE'])
df['DAY_OF_WEEK'] = df['DATE'].dt.weekday_name

df.groupby('DAY_OF_WEEK').sum().idxmax().to_string(index=False)

yields

'Sunday'
Josmoor98
  • 1,721
  • 10
  • 27
  • I'm a beginner, so how do I add columns on my list of data? This is what I have right now: import pandas as pd df1=pd.DataFrame(data) -- the data is in a big csv file. Thank you! – waiwai Aug 06 '19 at 13:46
  • Not sure I follow. Which list of data are you referring to? Do you mean DataFrame? – Josmoor98 Aug 06 '19 at 13:49
  • Oh so you don't actually have a DataFrame yet? – Josmoor98 Aug 06 '19 at 13:50
  • Have a look at [this](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) – Josmoor98 Aug 06 '19 at 13:51
  • import pandas as pd df1 = pd.DataFrame(data) This is what I have, so in your example you added column names columns = ["ADDRESS", "ID", "DATE", "VIOLATIONS"]) how do I add that to my data? – waiwai Aug 06 '19 at 13:58
  • If you're reading from a csv file, use the `names` parameter in `pd.read_csv()`. Or if you want to change the columns names after reading the csv, then `df.columns = ["ADDRESS", "ID", "DATE", "VIOLATIONS"]` – Josmoor98 Aug 06 '19 at 14:03
  • Does your csv already have column names? Also, `pd.DataFrame(data)` won't work if `data` is a csv file. You first need to use `pd.read_csv()` using your csv filepath. `pd.DataFrame()` doesn't accept csv files in the `data` parameter. – Josmoor98 Aug 06 '19 at 14:05
  • I added columns afterwards and it worked! Thank you so much!!! now, I need to get 2nd and 3rd questions. Any help would be so appreciate it! – waiwai Aug 06 '19 at 14:12
  • No problem. What does citations refer to in Q2 – Josmoor98 Aug 06 '19 at 14:15
  • citations means violations. Thank you! – waiwai Aug 06 '19 at 14:17
  • 2nd question solved! Now the third question is tricky. How do I show the active cameras (which is ID on my dataframe) increased or decreased over the collection period? – waiwai Aug 06 '19 at 15:24
  • By ***show the active cameras increased or decreased***, do you mean the number of violations caught by a camera increases or decreases? – Josmoor98 Aug 06 '19 at 15:34
  • I guess it depends on how you want to measure this. For example, you could simply return a boolean response to determine if `VIOLATIONS` has increased between 2 `DATES`. Or you could go deeper and regress `VIOLATIONS` on `DATE` perhaps. – Josmoor98 Aug 06 '19 at 15:46
  • Maybe we don't care about the violations. All I need to know is if the number of cameras have increased or decreased over time? so how do I sum all the unique ids per month/year? – waiwai Aug 06 '19 at 15:55
  • `df.groupby(df['DATE'].dt.month)["ID"].count()` or `df.groupby(df['DATE'].dt.year)["ID"].count()`. If you're using pandas 0.23 or newer, you can use `df.groupby(df['DATE'].dt.month_name)["ID"].count()` to return the month name as opposed to an integer representation. – Josmoor98 Aug 06 '19 at 16:13
  • If you found my answer useful, please consider accepting. Thanks – Josmoor98 Aug 06 '19 at 16:16
  • Yes, that worked perfectly! You're too awesome! The weird thing is I used dt.weekday_name to solve the 2nd question and it worked, but when I used month_name, it did not. Using month only gives me numerical values, wonder if there's another way to print out the month names? Thank you again! – waiwai Aug 06 '19 at 17:07
  • `Series.dt.month_name` only works in pandas 0.23 or greater. Also, if you want only unique `ID`, duplicates should be dropped first. So, `df.drop_duplicates('ID').groupby(df['DATE'].apply(lambda x: x.strftime('%B')))["ID"].count() ` should work – Josmoor98 Aug 06 '19 at 17:40
  • Probably best to use the original ‘.year’ – Josmoor98 Aug 06 '19 at 17:53
  • df1.drop_duplicates('ID') year_df = df1.groupby(data['DATE']).apply(lambda x: x.strftime('%Y'))["ID"].count print (year_df) it gave me an error. Any ideas? – waiwai Aug 06 '19 at 18:04
  • df1['ID'].drop_duplicates() year_df = df1.groupby(data['DATE'].dt.year)['ID'].count() print (year_df) Gave me the same result as if I didn't drop the duplicates. I wonder if it really worked? – waiwai Aug 06 '19 at 18:25
  • So, here's the final question: I need to plot out the graph and see if there are any outliers in the graph and I'm going to use historgram . But first, I need to figure out how many of the 162 cameras are in 2014, 2015...2018 and then plot it out. I first store data_2014 = data['DATE'].dt.year == 2014 for year 2014, but how do I find which of the 162 unique IDs occured in 2014? Thank you as always! – waiwai Aug 06 '19 at 20:15
  • I think it may be more suitable to post this as a separate question – Josmoor98 Aug 06 '19 at 20:35
  • I just did. here is the link https://stackoverflow.com/questions/57383980/graph-histogram-using-python-and-matplotlib But for some reason today when I click on clt k to format the data structure it went to google search bar instead so it's not as pretty, but it's the same data from this post. Thank you! – waiwai Aug 06 '19 at 21:06