Extract address from business card with maximum probability

Question

I have an image of the business card. Using OCR I can convert this image to Text. Now I want to separate information and add into contact.

By regex, I can parse information like phone, email, website but failed to isolate address from it because format varies from card to card.

I am using firebase ml kit on a device in the Android platform. I am attaching the output of OCR.

An input image of a business card from google images

An output of OCR is

Line 1 = [larriS, Insurance]
Line 2 = [A, Legacy, of, Quality, Service]
Line 3 = [Wayne, Stansfield,, i, CLCS]
Line 4 = [1380, Rio, Rancho, Blvd, SE363]
Line 5 = [Rio, Rancho,, NM, 87124]
Line 6 = [CELL, 505.554.0510]
Line 7 = [PHONE, 505-818-9377]
Line 8 = [FAX, 888-753.4449]
Line 9 = [WayneJames@me.com]

Checked link1, link2 and link3 but failed to find address from regex so I tried to find it from the indirect way.

If it has a postal code then try to find address through that but postal code varies too. Find some hope Using multiple regex for a different country but it is not the solution can you please help me to find a way to extract it. And I understand that it will work 100% for all type of format available in the market, but I want to cover maximum.

Here is reference Application which can do this

CardCam Application Business Card Reader Free - Business Card Scanner

Card reading API but these all are paid

Abbyy CardCam API

Do the business cards that you have to analyze always have the same format? Are them always from "Harris Insurance"? — Salvatore, Aug 14 '18 at 08:58
No. It is an example. I know that it varies from card to card that's why I want to parse most format not all. — Nil, Aug 14 '18 at 09:12
If a commercial solutions would be feasible: As the abbyy api will be discontinued in 2024, I just wanted to give notice to our API solution we offer. Disclaimer: I work for the company :) https://www.snapaddy.com/en/blog/post/business-card-scanning-api-alternative-for-abbyy-cloud-ocr-sdk.html — Sebastian Metzger, Aug 30 '23 at 12:27

score 4 · Accepted Answer · edited Aug 26 '19 at 09:44

You extract info by each line and recognize some of them, in example Lines 6-8 are recognized and also you could define 9 as email.

So your only doubt about Lines 1-5.

You can't be 100% sure that if line fits or doesn't any of regexp because there is no 'protocol' how the address should be printed on the card so you could just assume that

The most likely address should be on lines 2+ because on the 1st line in most cases there will be a company name.
One of the parts of address should contain predefined values, e.g.
- Blvd
- st.
- street
- [XX] (state definition)
- Zip - regex for zip-code is quite simple
- other keywords
Most likely address will begin with Zip code.

So if you combine all of this into a single approach you'll get an algorithm that could predict if there is an address with possibility.

According to the assumptions above more likely that line 4 and 5 address line because - Line 4 starts from a number that looks like a zip code, - Line 5 contains somewhat like state

UPDATE

Complex solution could look like:

public static float checkLineForAddress(List<String> testdata) {
        boolean containsZip = false;
        boolean containsState = false;
        boolean containsAddressKeyword = false;
        boolean containsWord = false;
        boolean containsCapitalizedWord = false;
        boolean containsNumber = false;
        boolean containsBuildingNum = false;
        for (String item : testdata) {
            Set<Map.Entry<String, String>> entries = zipRegexps.entrySet();
            for (Map.Entry<String, String> entry : entries) {
                containsZip = containsZip || item.matches(entry.getValue());
                if (containsZip) break;
            }
            containsState = containsState || item.matches("[A-Z]{2}");
            containsBuildingNum = containsBuildingNum || item.contains("/");
            containsWord = containsWord || item.matches("[A-Za-z]+");
            containsCapitalizedWord = containsCapitalizedWord || item.matches("[A-Z]+[a-z]+");
            for (String addressKeyword : addressKeywords) {
                containsAddressKeyword = containsAddressKeyword || item.replace(".", "").equalsIgnoreCase(addressKeyword);
            }
            containsNumber = containsNumber || item.matches("[0-9]+");
        }

        float addressProbability = 0;
        if (containsZip && containsCapitalizedWord && (containsState || containsAddressKeyword)) return 1f;
        if (containsZip && containsWord) addressProbability = 0.5f;
        if (containsCapitalizedWord) addressProbability += 0.1f;
        if (containsAddressKeyword) addressProbability += 0.2f;
        if (containsNumber) addressProbability += 0.05f;
        if (containsBuildingNum) addressProbability += 0.05f;
        if (testdata.size() > 1) addressProbability += 0.05f;
        if (testdata.size() > 2) addressProbability += 0.05f;
        return addressProbability;
    }

I've taken a list of zipcodes from here: What is the ultimate postal code and zip regex? , init method for variables:

private static void init() {
        zipRegexps.put("GB", "GIR[ ]?0AA|((AB|AL|B|BA|BB|BD|BH|BL|BN|BR|BS|BT|CA|CB|CF|CH|CM|CO|CR|CT|CV|CW|DA|DD|DE|DG|DH|DL|DN|DT|DY|E|EC|EH|EN|EX|FK|FY|G|GL|GY|GU|HA|HD|HG|HP|HR|HS|HU|HX|IG|IM|IP|IV|JE|KA|KT|KW|KY|L|LA|LD|LE|LL|LN|LS|LU|M|ME|MK|ML|N|NE|NG|NN|NP|NR|NW|OL|OX|PA|PE|PH|PL|PO|PR|RG|RH|RM|S|SA|SE|SG|SK|SL|SM|SN|SO|SP|SR|SS|ST|SW|SY|TA|TD|TF|TN|TQ|TR|TS|TW|UB|W|WA|WC|WD|WF|WN|WR|WS|WV|YO|ZE)(\\d[\\dA-Z]?[ ]?\\d[ABD-HJLN-UW-Z]{2}))|BFPO[ ]?\\d{1,4}");
        zipRegexps.put("JE", "JE\\d[\\dA-Z]?[ ]?\\d[ABD-HJLN-UW-Z]{2}");
        zipRegexps.put("GG", "GY\\d[\\dA-Z]?[ ]?\\d[ABD-HJLN-UW-Z]{2}");
        zipRegexps.put("IM", "IM\\d[\\dA-Z]?[ ]?\\d[ABD-HJLN-UW-Z]{2}");
        zipRegexps.put("US", "\\d{5}([ \\-]\\d{4})?");
        zipRegexps.put("CA", "[ABCEGHJKLMNPRSTVXY]\\d[ABCEGHJ-NPRSTV-Z][ ]?\\d[ABCEGHJ-NPRSTV-Z]\\d");
        zipRegexps.put("DE", "\\d{5}");
        zipRegexps.put("JP", "\\d{3}-\\d{4}");
        zipRegexps.put("FR", "\\d{2}[ ]?\\d{3}");
        zipRegexps.put("AU", "\\d{4}");
        zipRegexps.put("IT", "\\d{5}");
        zipRegexps.put("CH", "\\d{4}");
        zipRegexps.put("AT", "\\d{4}");
        zipRegexps.put("ES", "\\d{5}");
        zipRegexps.put("NL", "\\d{4}[ ]?[A-Z]{2}");
        zipRegexps.put("BE", "\\d{4}");
        zipRegexps.put("DK", "\\d{4}");
        zipRegexps.put("SE", "\\d{3}[ ]?\\d{2}");
        zipRegexps.put("NO", "\\d{4}");
        zipRegexps.put("BR", "\\d{5}[\\-]?\\d{3}");
        zipRegexps.put("PT", "\\d{4}([\\-]\\d{3})?");
        zipRegexps.put("FI", "\\d{5}");
        zipRegexps.put("AX", "22\\d{3}");
        zipRegexps.put("KR", "\\d{3}[\\-]\\d{3}");
        zipRegexps.put("CN", "\\d{6}");
        zipRegexps.put("TW", "\\d{3}(\\d{2})?");
        zipRegexps.put("SG", "\\d{6}");
        zipRegexps.put("DZ", "\\d{5}");
        zipRegexps.put("AD", "AD\\d{3}");
        zipRegexps.put("AR", "([A-HJ-NP-Z])?\\d{4}([A-Z]{3})?");
        zipRegexps.put("AM", "(37)?\\d{4}");
        zipRegexps.put("AZ", "\\d{4}");
        zipRegexps.put("BH", "((1[0-2]|[2-9])\\d{2})?");
        zipRegexps.put("BD", "\\d{4}");
        zipRegexps.put("BB", "(BB\\d{5})?");
        zipRegexps.put("BY", "\\d{6}");
        zipRegexps.put("BM", "[A-Z]{2}[ ]?[A-Z0-9]{2}");
        zipRegexps.put("BA", "\\d{5}");
        zipRegexps.put("IO", "BBND 1ZZ");
        zipRegexps.put("BN", "[A-Z]{2}[ ]?\\d{4}");
        zipRegexps.put("BG", "\\d{4}");
        zipRegexps.put("KH", "\\d{5}");
        zipRegexps.put("CV", "\\d{4}");
        zipRegexps.put("CL", "\\d{7}");
        zipRegexps.put("CR", "\\d{4,5}|\\d{3}-\\d{4}");
        zipRegexps.put("HR", "\\d{5}");
        zipRegexps.put("CY", "\\d{4}");
        zipRegexps.put("CZ", "\\d{3}[ ]?\\d{2}");
        zipRegexps.put("DO", "\\d{5}");
        zipRegexps.put("EC", "([A-Z]\\d{4}[A-Z]|(?:[A-Z]{2})?\\d{6})?");
        zipRegexps.put("EG", "\\d{5}");
        zipRegexps.put("EE", "\\d{5}");
        zipRegexps.put("FO", "\\d{3}");
        zipRegexps.put("GE", "\\d{4}");
        zipRegexps.put("GR", "\\d{3}[ ]?\\d{2}");
        zipRegexps.put("GL", "39\\d{2}");
        zipRegexps.put("GT", "\\d{5}");
        zipRegexps.put("HT", "\\d{4}");
        zipRegexps.put("HN", "(?:\\d{5})?");
        zipRegexps.put("HU", "\\d{4}");
        zipRegexps.put("IS", "\\d{3}");
        zipRegexps.put("IN", "\\d{6}");
        zipRegexps.put("ID", "\\d{5}");
        zipRegexps.put("IL", "\\d{5}");
        zipRegexps.put("JO", "\\d{5}");
        zipRegexps.put("KZ", "\\d{6}");
        zipRegexps.put("KE", "\\d{5}");
        zipRegexps.put("KW", "\\d{5}");
        zipRegexps.put("LA", "\\d{5}");
        zipRegexps.put("LV", "\\d{4}");
        zipRegexps.put("LB", "(\\d{4}([ ]?\\d{4})?)?");
        zipRegexps.put("LI", "(948[5-9])|(949[0-7])");
        zipRegexps.put("LT", "\\d{5}");
        zipRegexps.put("LU", "\\d{4}");
        zipRegexps.put("MK", "\\d{4}");
        zipRegexps.put("MY", "\\d{5}");
        zipRegexps.put("MV", "\\d{5}");
        zipRegexps.put("MT", "[A-Z]{3}[ ]?\\d{2,4}");
        zipRegexps.put("MU", "(\\d{3}[A-Z]{2}\\d{3})?");
        zipRegexps.put("MX", "\\d{5}");
        zipRegexps.put("MD", "\\d{4}");
        zipRegexps.put("MC", "980\\d{2}");
        zipRegexps.put("MA", "\\d{5}");
        zipRegexps.put("NP", "\\d{5}");
        zipRegexps.put("NZ", "\\d{4}");
        zipRegexps.put("NI", "((\\d{4}-)?\\d{3}-\\d{3}(-\\d{1})?)?");
        zipRegexps.put("NG", "(\\d{6})?");
        zipRegexps.put("OM", "(PC )?\\d{3}");
        zipRegexps.put("PK", "\\d{5}");
        zipRegexps.put("PY", "\\d{4}");
        zipRegexps.put("PH", "\\d{4}");
        zipRegexps.put("PL", "\\d{2}-\\d{3}");
        zipRegexps.put("PR", "00[679]\\d{2}([ \\-]\\d{4})?");
        zipRegexps.put("RO", "\\d{6}");
        zipRegexps.put("RU", "\\d{6}");
        zipRegexps.put("SM", "4789\\d");
        zipRegexps.put("SA", "\\d{5}");
        zipRegexps.put("SN", "\\d{5}");
        zipRegexps.put("SK", "\\d{3}[ ]?\\d{2}");
        zipRegexps.put("SI", "\\d{4}");
        zipRegexps.put("ZA", "\\d{4}");
        zipRegexps.put("LK", "\\d{5}");
        zipRegexps.put("TJ", "\\d{6}");
        zipRegexps.put("TH", "\\d{5}");
        zipRegexps.put("TN", "\\d{4}");
        zipRegexps.put("TR", "\\d{5}");
        zipRegexps.put("TM", "\\d{6}");
        zipRegexps.put("UA", "\\d{5}");
        zipRegexps.put("UY", "\\d{5}");
        zipRegexps.put("UZ", "\\d{6}");
        zipRegexps.put("VA", "00120");
        zipRegexps.put("VE", "\\d{4}");
        zipRegexps.put("ZM", "\\d{5}");
        zipRegexps.put("AS", "96799");
        zipRegexps.put("CC", "6799");
        zipRegexps.put("CK", "\\d{4}");
        zipRegexps.put("RS", "\\d{6}");
        zipRegexps.put("ME", "8\\d{4}");
        zipRegexps.put("CS", "\\d{5}");
        zipRegexps.put("YU", "\\d{5}");
        zipRegexps.put("CX", "6798");
        zipRegexps.put("ET", "\\d{4}");
        zipRegexps.put("FK", "FIQQ 1ZZ");
        zipRegexps.put("NF", "2899");
        zipRegexps.put("FM", "(9694[1-4])([ \\-]\\d{4})?");
        zipRegexps.put("GF", "9[78]3\\d{2}");
        zipRegexps.put("GN", "\\d{3}");
        zipRegexps.put("GP", "9[78][01]\\d{2}");
        zipRegexps.put("GS", "SIQQ 1ZZ");
        zipRegexps.put("GU", "969[123]\\d([ \\-]\\d{4})?");
        zipRegexps.put("GW", "\\d{4}");
        zipRegexps.put("HM", "\\d{4}");
        zipRegexps.put("IQ", "\\d{5}");
        zipRegexps.put("KG", "\\d{6}");
        zipRegexps.put("LR", "\\d{4}");
        zipRegexps.put("LS", "\\d{3}");
        zipRegexps.put("MG", "\\d{3}");
        zipRegexps.put("MH", "969[67]\\d([ \\-]\\d{4})?");
        zipRegexps.put("MN", "\\d{6}");
        zipRegexps.put("MP", "9695[012]([ \\-]\\d{4})?");
        zipRegexps.put("MQ", "9[78]2\\d{2}");
        zipRegexps.put("NC", "988\\d{2}");
        zipRegexps.put("NE", "\\d{4}");
        zipRegexps.put("VI", "008(([0-4]\\d)|(5[01]))([ \\-]\\d{4})?");
        zipRegexps.put("PF", "987\\d{2}");
        zipRegexps.put("PG", "\\d{3}");
        zipRegexps.put("PM", "9[78]5\\d{2}");
        zipRegexps.put("PN", "PCRN 1ZZ");
        zipRegexps.put("PW", "96940");
        zipRegexps.put("RE", "9[78]4\\d{2}");
        zipRegexps.put("SH", "(ASCN|STHL) 1ZZ");
        zipRegexps.put("SJ", "\\d{4}");
        zipRegexps.put("SO", "\\d{5}");
        zipRegexps.put("SZ", "[HLMS]\\d{3}");
        zipRegexps.put("TC", "TKCA 1ZZ");
        zipRegexps.put("WF", "986\\d{2}");
        zipRegexps.put("XK", "\\d{5}");
        zipRegexps.put("YT", "976\\d{2}");

        addressKeywords.add("blvd");
        addressKeywords.add("st");
        addressKeywords.add("street");
        addressKeywords.add("lane");
    }

TestData is

List<String> testdata = new ArrayList<>();
testdata.add("1380");
testdata.add("Rio");
testdata.add("Rancho");
testdata.add("Blvd");
testdata.add("SE363");
Log.e("!@#", String.valueOf(checkLineForAddress(testdata)));
testdata = new ArrayList<>();
testdata.add("Rio");
testdata.add("Rancho");
testdata.add("NM");
testdata.add("87124");
Log.e("!@#", String.valueOf(checkLineForAddress(testdata)));
testdata = new ArrayList<>();
testdata.add("Wayne");
testdata.add("Stansfield");
testdata.add("i");
testdata.add("CLCS");
Log.e("!@#", String.valueOf(checkLineForAddress(testdata)));
testdata = new ArrayList<>();
testdata.add("James");
testdata.add("Gordon");
testdata.add("Smith");
Log.e("!@#", String.valueOf(checkLineForAddress(testdata)));
testdata = new ArrayList<>();
testdata.add("5052");
testdata.add("554");
testdata.add("11500");
testdata.add("121151");
Log.e("!@#", String.valueOf(checkLineForAddress(testdata)));
testdata = new ArrayList<>();
testdata.add("Creative");
testdata.add("Director");
Log.e("!@#", String.valueOf(checkLineForAddress(testdata)));

And output

E/!@#: 1.0
E/!@#: 1.0
E/!@#: 0.70000005
E/!@#: 0.2
E/!@#: 0.15
E/!@#: 0.15

As you can see 3rd line is an Address with 70% probability because CLCS could be a Bermuda postal code theoretically.

You could modify possibilities according to your test data.

Thank you for breaking down it into the simple way. The information you some up in 2nd point is really impressive. As I mentioned in the question, I tried with by following zip code too. So I don't want to track zipcode on the basis of postal code. I want to find a more accurate way to do so. — Nil, Aug 14 '18 at 09:39
Zipcode could be only the one part of the solution, check my update. — dilix, Aug 15 '18 at 08:41
@dillix, Awesome work man :) I am testing this algorithm with formats which I want to cover most and will update you soon — Nil, Aug 16 '18 at 08:56
@dillix, I tested this algo and it worked like charm. To support the rest of the format, I will customize. Thank you for your time and effort and showing me way to achieve it. (y) — Nil, Aug 17 '18 at 06:18
@Upendra whole methods are described above in answer so you just need to copt/paste and use them and adjust to your needs — dilix, Sep 28 '18 at 11:33

Extract address from business card with maximum probability

1 Answers1