-2

enter image description here

I have some item codes (the numbers) on a receipt in pdf format, like above. How can I read the codes with a program? Thanks.

There aren't special characters marking the codes. Text formatting in each pdf is not the same. Converted pdf to word and to text file.

JW JW
  • 1

1 Answers1

0

This is a duplicate of How can I read pdf in python? .

Reading text with Python is not complicated. Based on the description you have given above, it is difficult to understand the scale of variations (in text being read) you are trying to cover.

Reading text is half the job. Understanding it and putting a context to read text is another job. It has many intermediate steps, depending upon the complexity you are trying to target.

If you could define the patterns you are trying to read using regular expressions, this job can be tackled without going into deeper and more complex areas such as natural language processing and machine learning.

Here is an example of regex for the string you supplied, note that patterns may not match as I have defined the regex rules based on certain criteria which may not be what you are looking for, however this should demonstrate how regular expression can help parsing structured strings :

string data : 885911248648 DW 1/4 Bit 4.78 regular expression : \d\d\d\d\d\d\d\d\d\d\d\d\sDW\s\d/\d\sBit \s\s\s\s\s\s\s\s\s\s\s\d.\d\d application in python :

    import re
    def use_regex(input_text):
    pattern = re.compile(r"\d\d\d\d\d\d\d\d\d\d\d\d\sDW\s\d/\d\sBit <A>\s\s\s\s\s\s\s\s\s\s\s\d\.\d\d", re.IGNORECASE)
    return pattern.match(input_text)

RegEx patterns generated using this tool.

Amogh Sarpotdar
  • 544
  • 4
  • 15