I am confused where the error creeps in as the deployed code produced integers and Boolean in the same column (nutrition
below) as a result of the code below. It does not occur in small data in testing. What can happen here?
In months where no LopNr
had more than 1 in the sum, pandas did not convert the True
to 1? Why not? In any case, is it safe to manually override the end result this way?
The data has rows with the relevant columns being like this:
LopNr DIAGNOS INDATUMA
1 E12 E14 20050705
The code is:
# -*- coding: utf-8 -*-
import numpy as np
import pandas as pd
all_treatments = list()
filelist = ['file1']
nutrition_codes = '|'.join(["D{}".format(i) for i in range(50, 54)] + ["E{}".format(i) for i in range(10, 15)] + ["E{}".format(i) for i in range(40, 47)] + ["E{}".format(i) for i in range(50, 69)])
for file in filelist:
filename = 'PATH/' + file +'.txt'
if file[0]=='o':
treatments = pd.read_table(filename,usecols=[0,8,10])
elif file[0]=='s':
treatments = pd.read_table(filename,usecols=[0,8,11])
else:
print "file should start with s or o, no?"
all_treatments.append(treatments)
all_treatments = pd.concat(all_treatments, ignore_index=True)
all_treatments['date'] = pd.to_datetime(all_treatments['INDATUMA'].astype(str), coerce=True)
all_treatments['year'] = all_treatments['date'].dt.year
all_treatments['month'] = all_treatments['date'].dt.month
all_treatments['nutrition'] = all_treatments.DIAGNOS.str.contains(nutrition_codes)
all_treatments = all_treatments.drop(['DIAGNOS','INDATUMA','date'], axis=1)
all_treatments = all_treatments.groupby(['LopNr','year','month']).sum().astype(int,copy=False,raise_on_error=False)
all_treatments.to_csv('PATH/treatments_monthly.csv')