1

I'm trying to parse out line items from text extracted from a PDF. The text extracted comes out poorly formatted and in one long string per page. There aren't any useful delimiters, but the lines start with one of two strings. I've set up the Split() using a string array with both of those strings, but I need to know which delimiter the elements were split on.

I found this link, but I'm not that great at RegEx. Can someone assist in writing the RegEx string?

    var lineItems = page.PageText.Split(new string[] { "First String Delimiter", "Second String Delimiter" }, StringSplitOptions.None);

What I need is to know is if element[x] was a result of "First String Delimiter" or "Second String Delimiter".

EDIT: I don't care if Regex is the solution. Linq may be equally suited. Linq didn't come out until after I earned my degrees, so I'm similarly unfamiliar with it.

Imagine a page with about 15-20 of these end to end coming back as one long string with no carriage returns: Since they all start with "Corporate Trade Payment Credit" or "Preauthorized ACH Credit", I can split on those, but I need to know what type it was.

Preauthorized ACH Credit (165) 10,000.00 489546541 0000000000 Text Some long description about transaction- Preauthorized ACH Credit (165) 5,310.99 8465498461 0000000000 Text Another long description Corporate Trade Payment Credit (165) 4,933.17 8478632458775 0000000000 Text Another confidential string description.

Community
  • 1
  • 1

2 Answers2

1

Why don't you just run the split twice, once with the first delimiter, then again with the second delimiter?

var firstDelimiterItems = page.PageText.Split("First String Delimiter");

var secondDelimiterItems = page.PageText.Split("Second String Delimiter");
Garrison Neely
  • 3,238
  • 3
  • 27
  • 39
  • That my be what I'm forced to do, but I would like it, if possible, to have each line item in its own element from the start. – Mike Evering Jul 02 '13 at 21:19
1

Sometimes the simplest solutions are the best ones. Don't know why this didn't occur to me earlier.

    var pageText = page.PageText.Replace("Corporate Trade Payment", "\r\nCorporate Trade Payment").Replace("Preauthorized ACH Credit", "\r\nPreauthorized ACH Credit");

This gives me the line items on their own lines. No Regex needed. Thank you all for your help, and if you find a way to the original question with Regex, please post. I'm always up to learning more.