Python - How to split text input into separate elements

Question

Input will be inconsistent with newlines so I cannot use newlines as some sort of delimiter. The text coming in will be in the following format:

IDNumber FirstName LastName Score Letter Location

IDNumber: 9 numbers

Score: 0-100

Letter: A or B

Location: Could be anything from abbreviated State name to a City and State fully spelled out. This is optional.

Ex:

123456789 John Doe 90 A New York City 987654321
Jane Doe 70 B CAL 432167895 John

Cena 60 B FL 473829105 Donald Trump 70 E
098743215 Bernie Sanders 92 A AR

The elements would be:

123456789 John Doe 90 A New York City
987654321 Jane Doe 70 B CAL
432167895 John Cena 60 B FL
473829105 Donald Trump 70 E
098743215 Bernie Sanders 92 A AR

I need to individually access each element for each person. So for the John Cena object, I would need to be able to access the ID:432167895, the first name: John, the last name: Cena, the B or A: B. I don’t really need the location, but it will be part of the input.

Edit: It should be worth mentioning I am not allowed to import any modules such as regular expressions.

If the input is a string, I would start by [splitting the string on whitespace characters](http://stackoverflow.com/questions/8113782/split-string-on-whitespace-in-python). — Anderson Green, Apr 19 '17 at 21:00

score 0 · Answer 1 · answered Apr 19 '17 at 21:17

You could use a regular expression, which would require each record to start with a 9-digit number, taking words together where necessary, and skipping the location:

res = re.findall(r"(\d{9})\s+(\S*)\s+(\S*(?:\s+\D\S*)*)\s+(\d+)\s+(\S*)", data)

Result is:

[('123456789', 'John', 'Doe', '90', 'A'), 
 ('987654321', 'Jane', 'Doe', '70', 'B'), 
 ('432167895', 'John', 'Cena', '60', 'B'), 
 ('473829105', 'Donald', 'Trump', '70', 'E'), 
 ('098743215', 'Bernie', 'Sanders', '92', 'A')]

score 0 · Answer 2 · answered Apr 19 '17 at 21:20

Since splitting on whitespace is not helpful for identification of the location, I would directly go for a regex:

import re

input_string = """123456789 John Doe 90 A New York City 987654321
Jane Doe 70 B CAL 432167895 John

Cena 60 B FL 473829105 Donald Trump 70 E
098743215 Bernie Sanders 92 A AR"""

search_string=re.compile(r"([0-9]{9})\W+([a-zA-Z ]+)\W+([a-zA-Z ]+)\W+([0-9]{1,3})\W+([AB])\W+([a-zA-Z ]+)\W+")
person_list = re.findall(search_string, input_string)

This yields:

[('123456789', 'John', 'Doe', '90', 'A', 'New York City'),
 ('987654321', 'Jane', 'Doe', '70', 'B', 'CAL'),
 ('432167895', 'John', 'Cena', '60', 'B', 'FL')]

Explanation of the groups in the regex:

ID: 9 digits (followed by at least one whitespace)
first and last name: 2 separate groups of characters divided by at least one whitespace (followed by at least one whitespace)
Score: one, two or three digits (followed by at least one whitespace)
Letter: A or B (followed by at least one whitespace)
Location: a group of characters (followed by at least one whitespace)

score 0 · Answer 3 · answered Apr 19 '17 at 21:26

Since you know the ID number is going to be at the start of each "record" and is 9 digits long, try splitting by the 9-digit id-number:

# Assuming your file is read in as a string s:
import re
records = re.split(r'[ ](?=[0-9]{9}\b)', s)

# record locator will end up holding your records as: {'<full name>' -> {'ID'-><ID value>, 'FirstName'-><FirstName value>, 'LastName'-><LastName value>, 'Letter'-><LetterValue>}, 'full name 2'->{...} ...}
record_locator = {}

field_names = ['ID', 'FirstName', 'LastName', 'Letter']

# Get the individual records and store their values:
for record in records:

    # You could filter the record string before doing this if it contains newlines etc
    values = record.split(' ')[:5]

    # Discard the int after the name eg. 90 in the first record
    del values[3]

    # Create a new entry for the full name. This will overwrite entries with the same name so you might want to use a unique id instead
    record_locator[values[1]+values[2]] = dict(zip(field_names, values))

Then to access the information:

print record_locator['John Doe']['ID'] # 987654321

score 0 · Answer 4 · answered Apr 19 '17 at 21:31

I think trying to split by the 9 digit number might be the best option.

import re

with open('data.txt') as f:
    data = f.read()
    results = re.split(r'(\d{9}[\s\S]*?(?=[0-9]{9}))', data)
    results = list(filter(None, results))
    print(results)

Gave me these results

['123456789 John Doe 90 A New York City ', '987654321\nJane Doe 70 B CAL ', '432167895 John\n\nCena 60 B FL ', '473829105 Donald Trump 70 E\n', '098743215 Bernie Sanders 92 A AR']

N Cheadle · Accepted Answer · 2017-04-19T23:22:24.090

There is probably a more elegant way to do this but based on an example string input below is an idea.

input = "123456789 John Doe 90 A New York City 987654321 Jane Doe 70 B CAL 473829105 Donald Trump 70 E 098743215 Bernie Sanders 92 A AR"

#split by whitespaces
output = input.split()

#create output to store as dictionary this could then be dumped to a json file
data = {'output':[]}
end = len(output)

i=0

while i< end:
    tmp = {}
    tmp['id'] = output[i]
    i=i+1
    tmp['fname']=output[i]
    i=i+1
    tmp['lname']=output[i]
    i=i+1
    tmp['score']=output[i]
    i=i+1
    tmp['letter']=output[i]
    i=i+1
    location = ""
    #Catch index out of bounds errors
    try:
        bool = output[i].isdigit()
        while not bool:
            location = location + " " + output[i]
            i=i+1
            bool = output[i].isdigit()
    except IndexError:
        print('Completed Array')

    tmp['location'] = location
    data['output'].append(tmp)

print(data)

This works perfect except when the location is not specified! Do you know how to fix it? The location element is optional. — Jackson Blankenship, Apr 19 '17 at 22:54
I did an update that just puts an empty string in location if nothing is there. — N Cheadle, Apr 19 '17 at 23:23

Python - How to split text input into separate elements

5 Answers5