using scikit-learn to classify multiple outputs of banking transactions

Question

Background...

I am trying to create a classifier that will try to automatically create ledger-cli entries based on previous ledger-cli entries and the transaction description provided in downloaded banking statements.

My idea is that I would parse entries from an existing ledger-cli file and extract the features and labels and use it to learn. Then when I import new transactions I would use the features previously extracted to predict two things..A) ledger destination account and B) Payee.

I have done a tonne of googling which I think have gotten me pretty far but I am not certain I am approaching this in the right way as I am really green with respect classification or if I understand everything enough to make the appropriate decisions that would yield satisfactory results. If my classifier cannot predict both the ledger account and payee I would then prompt for these values as needed.

I have used the answer supplied to this question as a template and modified by adding banking descriptions instead of stuff mentioning new york or london... use scikit-learn to classify into multiple categories

Each ledger entry consists of both one payee and one destination account.

When I tried my solution (similar to what was presented in the link above) I was expecting that for each input sample I would get back one predicted ledger destination account and one predicted payee. For some samples I did indeed get this returned but for others I only got a ledger destination account predicted or a payee predicted. Is this expected? How do I know when only one value returned if its the ledger destination account or the payee?

In addition I am not sure if what I am trying to do is considered multi-class, multi-label or multi-output?

Any help would be greatly appreciated.

here is my current script and output:

#! /usr/bin/env python3

import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.multiclass import OneVsRestClassifier
from sklearn import preprocessing

X_train = np.array(["POS MERCHANDISE",
"POS MERCHANDISE TIM HORTONS #57",
"POS MERCHANDISE LCBO/RAO #0266",
"POS MERCHANDISE RONA HOME & GAR",
"SPORT CHEK #264 NEPEAN ON",
"LOBLAWS 1035 NEPEAN ON",
"FARM BOY #90 NEPEAN ON",
"WAL-MART #3638 NEPEAN ON",
"COSTCO GAS W1263 NEPEAN ON",
"COSTCO WHOLESALE W1263 NEPEAN ON",
"FARM BOY #90",
"LOBLAWS 1035",
"YIG ROSS 819",
"POS MERCHANDISE STARBUCKS #456"
])
y_train_text = [["HOMESENSE","Expenses:Shopping:Misc"],
["TIM HORTONS","Expenses:Food:Dinning"],
["LCBO","Expenses:Food:Alcohol-tobacco"],
["RONA HOME & GARDEN","Expenses:Auto"],
["SPORT CHEK","Expenses:Shopping:Clothing"],
["LOBLAWS","Expenses:Food:Groceries"],
["FARM BOY","Expenses:Food:Groceries"],
["WAL-MART","Expenses:Food:Groceries"],
["COSTCO GAS","Expenses:Auto:Gas"],
["COSTCO","Expenses:Food:Groceries"],
["FARM BOY","Expenses:Food:Groceries"],
["LOBLAWS","Expenses:Food:Groceries"],
["YIG","Expenses:Food:Groceries"],
["STARBUCKS","Expenses:Food:Dinning"]]

X_test = np.array(['POS MERCHANDISE STARBUCKS #123',
                   'STARBUCKS #589',
                   'POS COSTCO GAS',
                   'COSTCO WHOLESALE',
                   "TIM HORTON'S #58",
                   'BOSTON PIZZA',
                   'TRANSFER OUT',
                   'TRANSFER IN',
                   'BULK BARN',
                   'JACK ASTORS',
                   'WAL-MART',
                   'WALMART'])

#target_names = ['New York', 'London']

lb = preprocessing.MultiLabelBinarizer()
Y = lb.fit_transform(y_train_text)

classifier = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', OneVsRestClassifier(LinearSVC()))])

classifier.fit(X_train, Y)
predicted = classifier.predict(X_test)
all_labels = lb.inverse_transform(predicted)

for item, labels in zip(X_test, all_labels):
    print ('%s => %s' % (item, ', '.join(labels)))

Output:

POS MERCHANDISE STARBUCKS #123 => Expenses:Food:Dinning
STARBUCKS #589 => Expenses:Food:Dinning, STARBUCKS
POS COSTCO GAS => COSTCO GAS, Expenses:Auto:Gas
COSTCO WHOLESALE => COSTCO, Expenses:Food:Groceries
TIM HORTON'S #58 => Expenses:Food:Dinning
BOSTON PIZZA => Expenses:Food:Groceries
TRANSFER OUT => Expenses:Food:Groceries
TRANSFER IN => Expenses:Food:Groceries
BULK BARN => Expenses:Food:Groceries
JACK ASTORS => Expenses:Food:Groceries
WAL-MART => Expenses:Food:Groceries, WAL-MART
WALMART => Expenses:Food:Groceries

As you can see some predicts only provide a ledger destination account and for some such as BULK BARN seems to default to 'Expenses:Food:Groceries'.

For predicting the Payee it really is just based on the transaction description and what payee its been mapped to in the past and would not be influenced by what destination ledger account was used. For predicting the ledger destination account might be more involved as it can be based on the description, as well as other possible features such as amount or day of week or month of transaction. For example, a purchase at Costco (sells mostly bulk food plus large electronics and furniture) that is $200 or less would more than likely be considered Groceries where a purchase of more than $200 might be considered Household or Electronics. Maybe I should be training two separate classifiers?

Here is an example of a leger entry that I am parsing to get the data that that I will use for features and to identify classes for ledger destination account and payee.

2017/01/01 * TIM HORTONS --payee
; Desc: _POS MERCHANDISE TIM HORTONS #57 -- transaction description
Expenses:Food:Dinning -- destination account $ 5.00
Assets:Cash

The parts in italic are the parts that I parse. I want to assign a destination account (such as Expenses:Food:Dinning) and a Payee (such as TIM HORTONS) based on matching the bank transaction description of a new transaction with that descriptions associated with previous transactions which is stored in the 'Desc' tag of the ledger entry.

You maybe should read and do tutorials from scikit-learn to make more clear what you need to use : http://scikit-learn.org/stable/tutorial/basic/tutorial.html — Dadep, Feb 14 '17 at 18:11
I have looked at the tutorials and multiple examples and think I understand most of it but still can not fit a model to what I am trying to accomplish and how I think the output would look. I am going to post my current working code when I get back home as maybe that will help clarify what I am trying to do and what I am currently doing. — Jeff M, Feb 14 '17 at 21:55
To understand better : do you already know the number and names of class ? Do you already know what can be all variable use for classification ? — Dadep, Feb 15 '17 at 10:19
Yes I know the number of classes and names as these are parsed from a ledger input file and the idea is to try to classify to a previously defined class (ledger destination account and payee). I think there is a typo in your second question...yes I know what can be available for use for classification. — Jeff M, Feb 15 '17 at 14:53
So you can use any method of multi class supervised machine learning for classification... you can start with something simple to understand like naive bayes (https://www.analyticsvidhya.com/blog/2015/09/naive-bayes-explained/, and, http://scikit-learn.org/stable/modules/naive_bayes.html). But first of all you have to prepare your data, make it in format that your classifier would be able to use. I had a look on your code but I don't understand very well your data — Dadep, Feb 15 '17 at 16:12
just an update...so I trained a classifier (naive bayes) that is pretty successful at predicting the 'Payee' based a y features. Now I would like to use that prediction plus additional features to predict 'ledger account'. Not sure how to continue...is this multioutput multilabel, is this chaining, ensemble, ? I have read too much that my head wants to explode. — Jeff M, Feb 23 '17 at 18:44
classification permit to predict a class knowing the feature. you can use the results of our first classification and add other feature and make a second classifier to predict "ledger account"... but you'll need to create a training set for this second classifier .... — Dadep, Feb 23 '17 at 20:54
Would another option be to combine labels (payee + account) into a composite label that the classifier will then see as one. It will increase possible classes but the predicted class will retain correlation between payee and account. For example, payee=Loblaws (which is a grocery chain in Canada) it would not be appropriate to predict/suggest an account as Home:Maintenance... — Jeff M, Feb 23 '17 at 23:53
@JeffM Would you be willing to share your code? I would be extremely interested (and could help making it in a polished, distributable command-line tool). — andreas-h, Jun 17 '17 at 16:30
@andreas-h yes for sure! I have forked a python script on git that I have been improving upon to make it more robust. Here is the link to my script https://github.com/mondjef/icsv2ledger the only thing it's missing at this point is the machine learning part which I'm kinda stuck on right now and have been limited on time. Take a look and if need be I can walk you through things give you my thoughts on where I think the machine learning part would fit in. — Jeff M, Jun 20 '17 at 01:07
@andreas-h any chance you had any luck implementing a classifier in the icsv2ledger script? I have more time currently and looking to take another stab at this over the next few weeks. — Jeff M, Oct 25 '17 at 20:12
I helped create [smart_importer](https://github.com/beancount/smart_importer), it achieves the same goal, implemented for beancount, and I also used scikit-learn. — Johannes, May 01 '18 at 19:24
yes thanks @Johannes , I have been looking at the smart_importer and like the implementation and goals with integrating with beancount and fava. I have just reached out to you a couple of days ago on smart_importer git as I have not had any success yet getting it to work in beancount/fava stack. — Jeff M, May 02 '18 at 20:55

Dadep · Answer 1 · 2017-02-24T13:06:43.353

about your last comment :

 Training set------->  1st classifier  <------- new data input
                           |                
                           |            
                           | 
                    output labelled data (payee)
                           +              
                     other features
                           |
New training set---> 2d classifier   
                           |
                           |
                    output labelled data (ledger account)

Or

Training set------->  1st classifier  <------- new data input
                           |                
                           |            
                           | 
                         output :
       multi-labelled data (payee and ledger account)

EDIT With 2 independent classifiers :

Training set for payee (with all relative feature)
            |
            |
      1st classifier <--------new input data
            |
            | 
       output labelled data (payee)


Training set for ledger account (with all relative feature)
            |
            |
      2d classifier <--------new input data
            |
            | 
       output labelled data (ledger account)

this is the exact part that I am confused on...once the payee has been predicted then an account should be predicted but the predicted account should only be an account that has been previously seen with the predicted payee (dependency). I want to say its more like your first drawing but is there a way that I can limit the second classifier is able to predict? — Jeff M, Feb 23 '17 at 22:12
the question is how information depend one to an other : payee and account ledger can be predict with the same feature ? if yes the second drawing better, if not : does ledger account strongly depend of result of payee classification ? if yes the first drawing could work, as an alternative or if not the third drawing could be better (see Edit on my post) — Dadep, Feb 24 '17 at 12:58
(in case of use of the first drawing you'll need to "manually" create the 2d training set ) — Dadep, Feb 24 '17 at 13:08

using scikit-learn to classify multiple outputs of banking transactions

1 Answers1