0

I get this text from a pdf invoice:

INVOCE DATE            Nº ITEM          CONTRACT DATA 

10/10/15           EN56000004567WWG      Standard Plan 3

  CONCEPT        AMOUNT       MONTHS   UNITPRIZE     PRIZE

CONCEPT AAA    47,101   MB      1,0    3,394074   159,86   Dollars
CONCEPT BBB    26,122   MB      1,0    3,394074    88,66   Dollars
CONCEPT CCC    37,101   MB      1,0    3,394074   125,92   Dollars

                       TOTAL       374,44 Dollars

This text is actually a table with several lines but only one colunm where data is in fact only separated with a diferent number of whitespaces in almost every line.

What I want is to get the amounts "47,101" , "26,122", "37,101" with a specific regex for each one based on their concept, for example: regex1 gets "47,101" looking for "CONCEPT AAA" and so on.

I have achieved to get "CONCEPT AAA 47,101" ​​using this R line:

regmatches(invoice,regexpr("\\bCONCEPT AAA\\s*([-,0-9]+)", invoice, perl=TRUE))

but I only want the number "47,101".

ADDITIONAL INFO

For read the pdf I use readPDF function from tm package in R which outputs this table which indeed it is a character vector.

Due to there are a lot of invoices with slight differences in disposition I prefer use regex way to get data rather than try a best pdf to table conversion.

BONUS:

Then I will would like to get the prices for each concept "159,86", "88,66", "125,92".

Pin
  • 155
  • 1
  • 11
  • 1
    You could also consider splitting your data to columns (using read.delim or something), if they're all this highly organized. – Heroka Nov 12 '15 at 12:03
  • 2
    Can you add a `dput(invoice)`? – David Arenburg Nov 12 '15 at 12:06
  • Your regex is correct. You can extract the number part "47,101" by select group 1 from your regex result. Have you tried that? – Hp93 Nov 12 '15 at 12:13
  • Hi @Heroka, I am reading the text from a pdf (readPDF) , this text is only a part from a very unstructured invoice. – Pin Nov 12 '15 at 12:23
  • @Pin that makes sense. Could you add a dput of your data to your question? Makes it easier to tinker/test solutions. – Heroka Nov 12 '15 at 12:24
  • Hi @Hp93, what i get with this regex is a string "CONCEPT AAA 47,101" not a vector whith 2 variables like "CONCEPT AAA" and "47,101", in this case, as you comment, i could get the price adding [2] at the end of r expression. – Pin Nov 12 '15 at 12:27
  • Sorry I don't know R. You can take a look at this [link](http://stackoverflow.com/questions/952275/regex-group-capture-in-r-with-multiple-capture-groups) to learn how to get groups from regex result. Tell me if you find your answer there. – Hp93 Nov 12 '15 at 13:08
  • @Heroka I have just updated my question with more info as you suggest. I also have to edit dput because it's too large and there are like one hundred spaces before data in each line. – Pin Nov 12 '15 at 16:56
  • @Hp93, I have already seen that answer, but I would like to do it with only regular expresions (without using more fuctions) because in my code regexs are a variables, so I can't include functions in it. Something like `regmatches(invoice,regexpr(regex[i], invoice, perl=TRUE))` – Pin Nov 12 '15 at 17:06
  • If there were a fixed number of characters (i.e., no arbitrary amount of whitespace) you could use a lookbehind assertion. Would you be willing to pre-process your text replacing arbitrary whitespace `"\s+"` with a single space `" "`? – Gregor Thomas Nov 12 '15 at 17:30

1 Answers1

0

If you'd be willing to replacem arbitrary whitespace with a single space, then you can use lookbehind. This could be done inline, but it's a little messy:

regmatches(
    x = gsub("\\s+", replacement = " ", x = invoice),
    m = regexpr(
        "(?<=\\bCONCEPT AAA\\s)([-,0-9]+)",
        gsub("\\s+", replacement = " ", x = invoice),
        perl = TRUE
    )
)
# [1] "47,101"

Seems a little more straightforward if you pre-process:

invoice_onespace = gsub("\\s+", replacement = " ", x = invoice)
regmatches(
    x = invoice_onespace ,
    m = regexpr(
        "(?<=\\bCONCEPT AAA\\s)([-,0-9]+)",
        invoice_onespace ,
        perl = TRUE
    )
)

Since the whole point of this seems to be that you can use the matching string as a variable, it should be hard to pull the concepts out with something like "\\bCONCEPT\\b(\\s[A-Za-z]*)*\\s(?=[0-9])" and paste together an appropriate match string like above using the concept extractions in the lookbehind assertion as a method to get all the concepts. I can't take a better guess at the exact regex needed without having a lot more information about what possible values for "concept" are - I'm assuming they're not all 3 capital letter strings as in your example. If they are, then "\\bCONCEPT\\b\\s[A-Z]{3}\\b" should work.

Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
  • Good one!, great simplification. You are right, concepts are very diferent like "Concept One" , "Service Two" , "Tax" , "Discounts" now question is what i call in the question "Bonus" : How can I get rest of data? like the "UNITPRIZE" for each Concept/Service/Tax? – Pin Nov 12 '15 at 19:07
  • Hopefully you can take this answer and generalize. However, if you want the full table the easy way is to use some sort of table conversion. If you "prefer to use regex", then you should acknowledge that you're recreating this from scratch in regex and it will take a fair bit of work. Alternatively, you could trim the first and last lines off of your string and then use `read.table()`. – Gregor Thomas Nov 12 '15 at 19:11