Abbreviations and text without spaces how to split

Question

I have a list of Key, Value pairs

lst = [
  ('AAI', 'AirportAuthorityofIndia'),
  ('AAO', 'AssistantAccountsOfficer'),
  ('AB', 'AutonomousBodies'),
  ('ABA', 'AntiBoostMissile'),
  ('ABC', 'AuditBureauofCirculation'),
  ('ABM', 'AntiBallisticMissile'),
  ('ABVP', 'AkhilBharatiyaVidyarthiParishad'),
  ('AC', 'AssistantCollector'),
  ('AC', 'AirConditioner'),
  ('ACL', 'AccessControlList'),
  ('ACT', 'AssociationofComputerTechnology')]

What I am trying to do is add spaces between the words in the values. For example:

I need to split:

('AAI', 'AirportAuthorityofIndia') into ('AAI', 'Airport Authority of India')

('ACT', 'AssociationofComputerTechnology') into ('ACT', 'Association of Computer Technology')

If it's only capital letters I can do it using Regular Exression

[(abbr, re.sub(r'([a-z])(?=[A-Z])', r'\1 ', long)) for abbr, long in lst]

and I get

[('AAI', 'Airport Authorityof India')....etc

How do I add space between the lowercase letters as well?

Or is there any other method I can use to do this?

The `of` here is problematic because otherwise you could split on capital letters. How do you expect to know if `of` is it's own word or part of a word like `roof`? — jordanm, Jun 18 '20 at 15:22
This is the problem I am Having as well. The word can be anything not just 'of'. It could also be something like 'in'. So i have no clue as to how to split all the attached words. — Anish Krishnan, Jun 18 '20 at 15:27

score 0 · Answer 1 · answered Jun 19 '20 at 01:51

I make a code like below. I hope this help you...

[A-Z]{1} - Occurence of capital letter only once followed by one or more lowercases [a-z]+

lst = [
('AAI', 'AirportAuthorityofIndia'),
('AAO', 'AssistantAccountsOfficer'),
('AB', 'AutonomousBodies'),
('ABA', 'AntiBoostMissile'),
('ABC', 'AuditBureauofCirculation'),
('ABM', 'AntiBallisticMissile'),
('ABVP', 'AkhilBharatiyaVidyarthiParishad'),
('AC', 'AssistantCollector'),
('AC', 'AirConditioner'),
('ACL', 'AccessControlList'),
('ACT', 'AssociationofComputerTechnology')]
for item in lst:
    result = re.findall(r'[A-Z]{1}[a-z]+', item[1]) 
    print(item[0],","," ".join(result))

score 0 · Answer 2 · edited Jun 20 '20 at 09:12

A solution although far from ideal.

Not using the abbreviations.
Based on this SO answer here using his infer_spaces(s) function
- It uses a word dictionary file words-by-frequency.txt

In your case you have to include native Indian words like Akhil, Bharatiya, Vidyarthi, Parishad and expand it to your custom word dictionary file to make it fully generic. That's how I did bellow.

    lst = [
      ('AAI', 'AirportAuthorityofIndia'),
      ('AAO', 'AssistantAccountsOfficer'),
       ...
    ]

    for (abbr, long) in lst:
        print(infer_spaces(long.lower()).title())

Outputs:

Airport Authority Of India
Assistant Accounts Officer
Autonomous Bodies
Anti Boost Missile
Audit Bureau Of Circulation
Anti Ballistic Missile
Akhil Bharatiya Vidyarthi Parishad
Assistant Collector
Air Conditioner
Access Control List
Association Of Computer Technology

Abbreviations and text without spaces how to split

2 Answers2

A solution although far from ideal.