Background...
I am trying to create a classifier that will try to automatically create ledger-cli entries based on previous ledger-cli entries and the transaction description provided in downloaded banking statements.
My idea is that I would parse entries from an existing ledger-cli file and extract the features and labels and use it to learn. Then when I import new transactions I would use the features previously extracted to predict two things..A) ledger destination account and B) Payee.
I have done a tonne of googling which I think have gotten me pretty far but I am not certain I am approaching this in the right way as I am really green with respect classification or if I understand everything enough to make the appropriate decisions that would yield satisfactory results. If my classifier cannot predict both the ledger account and payee I would then prompt for these values as needed.
I have used the answer supplied to this question as a template and modified by adding banking descriptions instead of stuff mentioning new york or london... use scikit-learn to classify into multiple categories
Each ledger entry consists of both one payee and one destination account.
When I tried my solution (similar to what was presented in the link above) I was expecting that for each input sample I would get back one predicted ledger destination account and one predicted payee. For some samples I did indeed get this returned but for others I only got a ledger destination account predicted or a payee predicted. Is this expected? How do I know when only one value returned if its the ledger destination account or the payee?
In addition I am not sure if what I am trying to do is considered multi-class, multi-label or multi-output?
Any help would be greatly appreciated.
here is my current script and output:
#! /usr/bin/env python3
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.multiclass import OneVsRestClassifier
from sklearn import preprocessing
X_train = np.array(["POS MERCHANDISE",
"POS MERCHANDISE TIM HORTONS #57",
"POS MERCHANDISE LCBO/RAO #0266",
"POS MERCHANDISE RONA HOME & GAR",
"SPORT CHEK #264 NEPEAN ON",
"LOBLAWS 1035 NEPEAN ON",
"FARM BOY #90 NEPEAN ON",
"WAL-MART #3638 NEPEAN ON",
"COSTCO GAS W1263 NEPEAN ON",
"COSTCO WHOLESALE W1263 NEPEAN ON",
"FARM BOY #90",
"LOBLAWS 1035",
"YIG ROSS 819",
"POS MERCHANDISE STARBUCKS #456"
])
y_train_text = [["HOMESENSE","Expenses:Shopping:Misc"],
["TIM HORTONS","Expenses:Food:Dinning"],
["LCBO","Expenses:Food:Alcohol-tobacco"],
["RONA HOME & GARDEN","Expenses:Auto"],
["SPORT CHEK","Expenses:Shopping:Clothing"],
["LOBLAWS","Expenses:Food:Groceries"],
["FARM BOY","Expenses:Food:Groceries"],
["WAL-MART","Expenses:Food:Groceries"],
["COSTCO GAS","Expenses:Auto:Gas"],
["COSTCO","Expenses:Food:Groceries"],
["FARM BOY","Expenses:Food:Groceries"],
["LOBLAWS","Expenses:Food:Groceries"],
["YIG","Expenses:Food:Groceries"],
["STARBUCKS","Expenses:Food:Dinning"]]
X_test = np.array(['POS MERCHANDISE STARBUCKS #123',
'STARBUCKS #589',
'POS COSTCO GAS',
'COSTCO WHOLESALE',
"TIM HORTON'S #58",
'BOSTON PIZZA',
'TRANSFER OUT',
'TRANSFER IN',
'BULK BARN',
'JACK ASTORS',
'WAL-MART',
'WALMART'])
#target_names = ['New York', 'London']
lb = preprocessing.MultiLabelBinarizer()
Y = lb.fit_transform(y_train_text)
classifier = Pipeline([
('vectorizer', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', OneVsRestClassifier(LinearSVC()))])
classifier.fit(X_train, Y)
predicted = classifier.predict(X_test)
all_labels = lb.inverse_transform(predicted)
for item, labels in zip(X_test, all_labels):
print ('%s => %s' % (item, ', '.join(labels)))
Output:
POS MERCHANDISE STARBUCKS #123 => Expenses:Food:Dinning
STARBUCKS #589 => Expenses:Food:Dinning, STARBUCKS
POS COSTCO GAS => COSTCO GAS, Expenses:Auto:Gas
COSTCO WHOLESALE => COSTCO, Expenses:Food:Groceries
TIM HORTON'S #58 => Expenses:Food:Dinning
BOSTON PIZZA => Expenses:Food:Groceries
TRANSFER OUT => Expenses:Food:Groceries
TRANSFER IN => Expenses:Food:Groceries
BULK BARN => Expenses:Food:Groceries
JACK ASTORS => Expenses:Food:Groceries
WAL-MART => Expenses:Food:Groceries, WAL-MART
WALMART => Expenses:Food:Groceries
As you can see some predicts only provide a ledger destination account and for some such as BULK BARN seems to default to 'Expenses:Food:Groceries'.
For predicting the Payee it really is just based on the transaction description and what payee its been mapped to in the past and would not be influenced by what destination ledger account was used. For predicting the ledger destination account might be more involved as it can be based on the description, as well as other possible features such as amount or day of week or month of transaction. For example, a purchase at Costco (sells mostly bulk food plus large electronics and furniture) that is $200 or less would more than likely be considered Groceries where a purchase of more than $200 might be considered Household or Electronics. Maybe I should be training two separate classifiers?
Here is an example of a leger entry that I am parsing to get the data that that I will use for features and to identify classes for ledger destination account and payee.
2017/01/01 * TIM HORTONS --payee
; Desc: _POS MERCHANDISE TIM HORTONS #57 -- transaction description
Expenses:Food:Dinning -- destination account $ 5.00
Assets:Cash
The parts in italic are the parts that I parse. I want to assign a destination account (such as Expenses:Food:Dinning) and a Payee (such as TIM HORTONS) based on matching the bank transaction description of a new transaction with that descriptions associated with previous transactions which is stored in the 'Desc' tag of the ledger entry.