0

I have a Dataframe with many computers of different producers, released in different Years, their Salenumbers...

Now, my goal is to find all new Computers released in 2015 which didn't exist in any years before. That means I have to control, if the Computernames are listed in any Years bevor 2015 and if they are, I want to delete these computernames of my 2015 list. Furthermore, there are also computers released in 2016,2017,... which shouldn't be forgotten. I want to have the number of these new computers.

Well, I have so many values, that I don't know if there are duplicates of the names, only with different Years, but this was my first idea.

df_noduplicates=df[df.Year<2016](subset=['Name'], keep='first')
df_Year2013 = df[df.Year==2015]
print(df_Year2015.shape(0))

But I only get the Error 'DataFrame' object is not callable after running. It should be because of the first line, but I don't know, what I did wrong.

Another problem is, that I should use 'set' to solve this exercise, but I don't know how it could be used in this context.

Thank you for your help in advance. :)

booklover
  • 43
  • 6
  • 1
    Hello! Could you please add some sample data? : ) – Minarth Oct 22 '20 at 07:21
  • Just FYI, your error is because `df[df.Year<2016](subset=['Name'], keep='first')` tries to call the df (like it was a function or method) with the parameters `['Name']` and `'first'`, but a dataframe is not [callable](https://stackoverflow.com/questions/111234/what-is-a-callable) – rcriii Oct 22 '20 at 12:40

1 Answers1

0

How about :

#find all computers' names present before 2015
s = set(df[df.Year<2015]['Name'])

# extract from the dataframe the lines where the name isn't already in s AND are there in 2015 (be carefull about those parenthesis)
subset_df = df[(df.Name.isin(s)==False) & (df.Year==2015)]

#print the names directly from the subset :
new_names  = subset_df['Name'].tolist()
print(new_names)
tgrandje
  • 2,332
  • 11
  • 33
  • I understand the first line. – booklover Oct 22 '20 at 12:08
  • i'll edit it the whole lot to make it easier to understand – tgrandje Oct 22 '20 at 12:12
  • I understand the first line. I am not sure about the second line. If I am right, set deletes all duplicates and only leaves the first ones which would not be in 2015. The second line controls if the Name is in my set list and if it is not, it will be added to the list. Sorry I am bit confused, but I do not think, that I get the computers which are just released in 2015. – booklover Oct 22 '20 at 12:14
  • Ah ok, thank you, but as I said, I want to have all computers which were only released in 2015 and did not occur in any years before. – booklover Oct 22 '20 at 12:16
  • I just corrected two typos btw. I had indeed make a mistake about the year... – tgrandje Oct 22 '20 at 12:16
  • So my goal is to test, if there any duplicates in the years before 2015 and if there are, I want these Computers deleted of my list of Computers in 2016. Sorry for my bad English. – booklover Oct 22 '20 at 12:17
  • Ok, I'll check that. Think to update your question about your goal – tgrandje Oct 22 '20 at 12:20
  • Ah okay, I understand, thank you. But if you do so, you will also have the names not just of 2015 but also the years later 2016, 2017,... To prevent this, could I add: subset_df = df[df[df.Name==2015].Name.isin(s)==False] – booklover Oct 22 '20 at 12:22
  • Yes I will do, thank you very much. But I think your idea is great! :) – booklover Oct 22 '20 at 12:23
  • I think this is it. I have edited this answer to show you how to combine multicriteria query (using both name and year at the same time) – tgrandje Oct 22 '20 at 12:26
  • Okay, great, thank you :) So my suggestion would be wrong and I have to combine mulicriteria query like you did, right? – booklover Oct 22 '20 at 12:29
  • Sorry, I have to annoy you with just one more question. Only for understanding... (df.Name.isin(s)==False) With this part you check if the names are the same as in the set and if they are, they get the boolean 'False'. So if a value is 'False', does that mean, that it is automatically excluded of the new dataframe? – booklover Oct 22 '20 at 12:33
  • In fact, this returns a Boolean Series. You can use print(df[(df.Name.isin(s)==False]) to understand what this means. You can find more documentation about Boolean Indexing here : https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#boolean-indexing – tgrandje Oct 22 '20 at 12:36
  • (In fact, when you index using mutliple Boolean series, you will have something like a rowwise combination, returning a single Boolean series, which will in turn serve to index your whole dataframe) – tgrandje Oct 22 '20 at 12:38
  • (Please mark the answer as accepted if this has been working for you ;-) ) – tgrandje Oct 22 '20 at 12:38
  • Jep, alright, thank you very much for your help! :) – booklover Oct 22 '20 at 12:45