0

I am writing a function that returns a dictionary with the year of the docs as key and, as value, it specifies a tuple that is returned by def do_get_citations_per_year function.

This function processes the df:

def do_process_citation_data(f_path):
    global my_ocan

    my_ocan = pd.read_csv(f_path, names=['oci', 'citing', 'cited', 'creation', 'timespan', 'journal_sc', 'author_sc'],
                          parse_dates=['creation', 'timespan'])
    my_ocan = my_ocan.iloc[1:]  # to remove the first row
    my_ocan['creation'] = pd.to_datetime(my_ocan['creation'], format="%Y-%m-%d", yearfirst=True)
    my_ocan['timespan'] = my_ocan['timespan'].map(parse_timespan)
    #print(my_ocan.info())
    print(my_ocan['timespan'])
    return my_ocan

Then I have this function, when running it it does not trigger any error:

    result = tuple()
    my_ocan['creation'] = pd.DatetimeIndex(my_ocan['creation']).year

    len_citations = len(my_ocan.loc[my_ocan["creation"] == year, "creation"])
    timespan = round(my_ocan.loc[my_ocan["creation"] == year, "timespan"].mean())
    result = (len_citations, timespan)
    print(result)


    return result

When I run that function inside of another function:

def do_get_citations_all_years(data):
    mydict = {}
    s = set(my_ocan.creation)
    for year in s:
        mydict[year] = do_get_citations_per_year(data, year)

    return mydict

I get the error:

  File "/Users/lisa/Desktop/yopy/execution_example.py", line 28, in <module>
    print(my_ocan.get_citations_all_years())
  File "/Users/lisa/Desktop/yopy/ocan.py", line 35, in get_citations_all_years
    return do_get_citations_all_years(self.data)
  File "/Users/lisa/Desktop/yopy/lisa.py", line 112, in do_get_citations_all_years
    mydict[year] = do_get_citations_per_year(data, year)
  File "/Users/lisa/Desktop/yopy/lisa.py", line 99, in do_get_citations_per_year
    timespan = round(my_ocan.loc[my_ocan["creation"] == year, "timespan"].mean())
ValueError: cannot convert float NaN to integer

What can I do to solve the issue?

Thank you in advance

2 Answers2

1

This error means that my_ocan.loc[my_ocan["creation"] == year, "timespan"].mean() is NaN.

You should fill NaN values with 0 before calculating mean because it will not change the mean. Here is an example:

timespan = my_ocan.loc[my_ocan["creation"] == year, "timespan"].fillna(0).mean()
Ha Bom
  • 2,787
  • 3
  • 15
  • 29
  • 1
    * is `nan` not `None`. – Guimoute Nov 26 '19 at 10:33
  • 1
    @Guimoute Thank you. My mistake :) – Ha Bom Nov 26 '19 at 10:34
  • You can also remove the `NaN` values entirely by calling `dropna()` [apparently](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html). That looks more explicit. :) – Guimoute Nov 26 '19 at 10:40
  • Hey! Thank you for your answer, I run your solution but the error remains =( timespan = round(my_ocan.loc[my_ocan["creation"] == year, "timespan"].fillna(0).mean()) ValueError: cannot convert float NaN to integer – Lisa Siurina Nov 26 '19 at 11:10
  • I posted it also above in the question where it is more readable =) – Lisa Siurina Nov 26 '19 at 12:25
0

@Ha Bom, filling with zeros will change the mean, I guess the solution would be to drop rows with NaN instead :

timespan = my_ocan.loc[my_ocan["creation"] == year, "timespan"].dropna().mean()

If you do not want to drop any rows than you will want to fillna with mean for example see this Stackoverflow question for an example

Edit @Ha Bom solution was good seing that the point was to replace the mean by zero

Luis Blanche
  • 557
  • 9
  • 18
  • Hi, Luis! I overpassed the error running your solution but the result is not what expected to be: For example, by running print(my_ocan.get_citations_per_year(2018)) I get (32, 451.8235294117647) but when this function is called inside another one which returns dictionary, I get {2016: (0, nan), 2017: (0, nan), 2018: (0, nan), 2013: (0, nan), 2015: (0, nan)} Why is that? – Lisa Siurina Nov 26 '19 at 11:47
  • Hi Lisa, what does your function to build the dict look like ? – Luis Blanche Nov 26 '19 at 12:18
  • def do_get_citations_all_years(data): mydict = {} s = set(my_ocan.creation) for year in s: mydict[year] = do_get_citations_per_year(data, year) return mydict – Lisa Siurina Nov 26 '19 at 12:24
  • ah, I got it, I think the reason is that I dropped the rows by using dropna() so, I do not want to do this. I need those values 0.0 since they are important while calculating an average later. Is there any other way I can fill those empty values? – Lisa Siurina Nov 26 '19 at 13:11
  • Well if your missing values can be replaced by zero then @Ha Bom solution is the right one. But you have to be sure that you have data for every year otherwise `my_ocan.loc[my_ocan["creation"] == year, "timespan"]` might just be empty – Luis Blanche Nov 26 '19 at 13:54
  • I run the function with .fillna(0) and got {2016: (0, nan), 2017: (0, nan), 2018: (0, nan), 2013: (0, nan), 2015: (0, nan)} instead of the output like {2016: 32, 240.03125). So, even filling with zeros trigger the same issue =( – Lisa Siurina Nov 26 '19 at 22:13
  • `def do_get_citations_all_years(data): mydict = {} s = set(my_ocan.creation) for year in s: mydict[year] = do_get_citations_per_year(data, year) return mydict` This means your function do_get_citations_per_year always returns (0, nan) , – Luis Blanche Nov 27 '19 at 08:03
  • Thank you Luis, but why? When I run It separately it returns a tuple like this (32, 240.03125). I am so sorry for bothering you. – Lisa Siurina Nov 27 '19 at 10:27
  • I think I have it : you are looping on `set(my_ocan.creation)` wich according to your code contains datetimes. Here's a solution : ```def do_get_citations_all_years(data): mydict = {} s = set(my_ocan.creation) for dt in s: mydict[year] = do_get_citations_per_year(data, dt.year) return mydict``` – Luis Blanche Nov 27 '19 at 13:00