Get data from receipt using regular expressions

Question

I am using regular expressions to get each line item's data from a receipt. The receipts are going to look like this:

Qty Desc
1   JD *#
    MARTINI *#   
2   XXXXXX 
3   YYYYYY
4   JD
    PEPSI *#

All items have quantities and descriptions, and some of them have an extra *#. Also, note that the descriptions can have spaces in them, and even more than one line, each line being able to have its own *#. I want to catch the quantity and description (if more than one line, get all lines), and I do not care at all about the extra *#. So in this example, for the first line item I would catch Quantity=1 and Description="JD MARTINI". For the fourth, Quantity=4 and Description="JD PEPSI".

My current regular expression looks like this:

((\d+)\s+(.*)(\s+\*#)?)

It is not working, and I assume it is because making the last parenthesis optional allows the greedy (.*) to catch absolutely everything. If the last parenthesis wasn't optional, the regular expression would do its job for the line items with the extra *#, but it wouldn't match the first and third one (because they don't have the extra *#).

Any ideas?

Are your descriptions all solid text, or do they have spaces in them? — Ann L., Dec 11 '12 at 23:33
You might want to try the regex test harness at regexlib.com, BTW. I've found it very helpful. — Ann L., Dec 11 '12 at 23:35
Sorry for forgetting to mention that. They do have spaces in them. — erictrigo, Dec 11 '12 at 23:36

score 1 · Accepted Answer · edited May 23 '17 at 11:48

After reading your modified question, I have determined that what you wish to accomplish cannot be done with one regular expression. You will have to do a combination of regex match + replace. (see this question: Regular expression to skip character in capture group)

Match Regex: (\d+)\s+([A-Z\s*#]*[A-Z]+)

Replace Regex: (*#(\s*))|(\r\n\s+)(?=\s)

The match regex will match the quantity and the item description, including any in-between line breaks or *# occurrences, leaving out the final *#. I am assuming the last character in a description is a letter.

After you run the match regex, you will get an array of matches back out, which you will need to iterate through to turn into objects. I wrote some handy code to do that for you. For each object, you will run the replace regex on the object's description, which will remove the extraneous spaces and *#.

     class ReceiptItem
    {
        public int Quantity { get; set; }
        public string Description { get; set; }

        public override string ToString()
        {
            return string.Format("{0}\t{1}", Quantity, Description);
        }
    }

    private void button1_Click(object sender, EventArgs e)
    {
        var matches = Regex.Matches(textBox1.Text, @"(\d+)\s+([A-Z\s\*\#]*[A-Z]+)", RegexOptions.Multiline);
        var items = (from Match m in matches
                     select new ReceiptItem()
                                {
                                    Quantity = int.Parse(m.Groups[1].Value),
                                    Description = Regex.Replace(m.Groups[2].Value, @"(\*\#(\s*))|(\r\n\s+)(?=\s)", "")
                                });

        listBox1.Items.AddRange(items.ToArray());
    }

It doesn't do exactly what I need, but it's my fault for not giving a better and more explained example of what I'm trying to accomplish. Please take a look at the edited question. — erictrigo, Dec 11 '12 at 23:54

score 0 · Answer 2 · answered Dec 11 '12 at 23:37

0

Try this regex (with Multiline option):

(\d+)\s+(?:(.*)(?:\s+\*#)|([^#]*))$

answered Dec 11 '12 at 23:37

manji

47,442
5
96
103

It may be because I'm using The Regex Coach, but it doesn't match anything. – erictrigo Dec 11 '12 at 23:42
I tried it on this page: http://derekslager.com/blog/posts/2007/09/a-better-dotnet-regular-expression-tester.ashx – manji Dec 11 '12 at 23:45
It's matching in Regex Coach. Have you checked the multiline box? – manji Dec 11 '12 at 23:54
If you use this as target string: "1 Example One\n\n1 Example Two *#\n Test 1 *#\n1 Example Three *#\nTest 2 *#", it will not match "1 Example Two *#", and it will capture a few extra spaces with "1 Example One". – erictrigo Dec 12 '12 at 14:29

score 0 · Answer 3 · answered Dec 11 '12 at 23:50

0

Try this out. I think it does what you need.

((\d+)\s+(.+?)(\s+\*#)*)

answered Dec 11 '12 at 23:50

Jake

82
3

It only gets the first character of each line item's description. – erictrigo Dec 11 '12 at 23:56

Get data from receipt using regular expressions

3 Answers3