I'm trying to train a PassiveAggressiveClassifier
using TfidVectorizer
with partial_fit
technique in the script below:
Code Updated:
a, ta = [], []
r, tr = [], []
g = []
vect = HashingVectorizer(ngram_range=(1,4))
model = PassiveAggressiveClassifier()
with open('files', 'rb') as f:
for line in f:
line = line.strip()
with open('gau-' + line + '.csv', 'rb') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
res = row['gau']
g.append(res)
cls = np.unique(g)
print(len(cls))
with open('gau-' + line + '.csv', 'rb') as csvfile:
reader = csv.DictReader(csvfile)
i = 0
j = True
for row in reader:
arr = row['text']
res = row['gau']
a.append(arr)
if(len(res) > 0):
r.append(int(res))
i = i + 1
if i % 400 == 0:
training_set = vect.fit_transform(a)
print(training_set.shape)
training_result = np.array(r)
model = model.partial_fit(
training_set, training_result, classes=cls)
a, r, i = [], [], 0
print(model)
testing_set = vect.transform(ta)
testing_result = np.array(tr)
predicted = model.predict(testing_set)
print "Result to be predicted: "+testing_result
print "Prediction: "+predicted
There are multiple CSV files each containing 4k-5k records and I am trying to fit 400 records at a time using the partial_fit
function. When I ran this code, I ran into the following error:
Result to be predicted: 1742
Prediction: 2617
How do I resolve this issue? The records in my CSV files are of variable length.
UPDATE:
Replacing TfidVectorizer
with HashingVectorizer
, I successfully created my model, but now while executing prediction on my test data the predictions generated were all incorrect.
My training data contains millions of lines of csv files and each line contains at most 4k-5k words of text.
So Is there any problem with my approach i.e. can these algorithms can be used with my data?