1

Input will be inconsistent with newlines so I cannot use newlines as some sort of delimiter. The text coming in will be in the following format:

IDNumber FirstName LastName Score Letter Location

  • IDNumber: 9 numbers
  • Score: 0-100
  • Letter: A or B
  • Location: Could be anything from abbreviated State name to a City and State fully spelled out. This is optional.

Ex:

123456789 John Doe 90 A New York City 987654321
Jane Doe 70 B CAL 432167895 John

Cena 60 B FL 473829105 Donald Trump 70 E
098743215 Bernie Sanders 92 A AR

The elements would be:

123456789 John Doe 90 A New York City
987654321 Jane Doe 70 B CAL
432167895 John Cena 60 B FL
473829105 Donald Trump 70 E
098743215 Bernie Sanders 92 A AR

I need to individually access each element for each person. So for the John Cena object, I would need to be able to access the ID:432167895, the first name: John, the last name: Cena, the B or A: B. I don’t really need the location, but it will be part of the input.

Edit: It should be worth mentioning I am not allowed to import any modules such as regular expressions.

  • 1
    If the input is a string, I would start by [splitting the string on whitespace characters](http://stackoverflow.com/questions/8113782/split-string-on-whitespace-in-python). – Anderson Green Apr 19 '17 at 21:00

5 Answers5

0

You could use a regular expression, which would require each record to start with a 9-digit number, taking words together where necessary, and skipping the location:

res = re.findall(r"(\d{9})\s+(\S*)\s+(\S*(?:\s+\D\S*)*)\s+(\d+)\s+(\S*)", data)

Result is:

[('123456789', 'John', 'Doe', '90', 'A'), 
 ('987654321', 'Jane', 'Doe', '70', 'B'), 
 ('432167895', 'John', 'Cena', '60', 'B'), 
 ('473829105', 'Donald', 'Trump', '70', 'E'), 
 ('098743215', 'Bernie', 'Sanders', '92', 'A')]
trincot
  • 317,000
  • 35
  • 244
  • 286
0

Since splitting on whitespace is not helpful for identification of the location, I would directly go for a regex:

import re

input_string = """123456789 John Doe 90 A New York City 987654321
Jane Doe 70 B CAL 432167895 John

Cena 60 B FL 473829105 Donald Trump 70 E
098743215 Bernie Sanders 92 A AR"""

search_string=re.compile(r"([0-9]{9})\W+([a-zA-Z ]+)\W+([a-zA-Z ]+)\W+([0-9]{1,3})\W+([AB])\W+([a-zA-Z ]+)\W+")
person_list = re.findall(search_string, input_string)

This yields:

[('123456789', 'John', 'Doe', '90', 'A', 'New York City'),
 ('987654321', 'Jane', 'Doe', '70', 'B', 'CAL'),
 ('432167895', 'John', 'Cena', '60', 'B', 'FL')]

Explanation of the groups in the regex:

  • ID: 9 digits (followed by at least one whitespace)
  • first and last name: 2 separate groups of characters divided by at least one whitespace (followed by at least one whitespace)
  • Score: one, two or three digits (followed by at least one whitespace)
  • Letter: A or B (followed by at least one whitespace)
  • Location: a group of characters (followed by at least one whitespace)
Christian König
  • 3,437
  • 16
  • 28
0

Since you know the ID number is going to be at the start of each "record" and is 9 digits long, try splitting by the 9-digit id-number:

# Assuming your file is read in as a string s:
import re
records = re.split(r'[ ](?=[0-9]{9}\b)', s)

# record locator will end up holding your records as: {'<full name>' -> {'ID'-><ID value>, 'FirstName'-><FirstName value>, 'LastName'-><LastName value>, 'Letter'-><LetterValue>}, 'full name 2'->{...} ...}
record_locator = {}

field_names = ['ID', 'FirstName', 'LastName', 'Letter']

# Get the individual records and store their values:
for record in records:

    # You could filter the record string before doing this if it contains newlines etc
    values = record.split(' ')[:5]

    # Discard the int after the name eg. 90 in the first record
    del values[3]

    # Create a new entry for the full name. This will overwrite entries with the same name so you might want to use a unique id instead
    record_locator[values[1]+values[2]] = dict(zip(field_names, values))

Then to access the information:

print record_locator['John Doe']['ID'] # 987654321
sgrg
  • 1,210
  • 9
  • 15
0

I think trying to split by the 9 digit number might be the best option.

import re

with open('data.txt') as f:
    data = f.read()
    results = re.split(r'(\d{9}[\s\S]*?(?=[0-9]{9}))', data)
    results = list(filter(None, results))
    print(results)

Gave me these results

['123456789 John Doe 90 A New York City ', '987654321\nJane Doe 70 B CAL ', '432167895 John\n\nCena 60 B FL ', '473829105 Donald Trump 70 E\n', '098743215 Bernie Sanders 92 A AR']
davidejones
  • 1,869
  • 1
  • 16
  • 18
0

There is probably a more elegant way to do this but based on an example string input below is an idea.

input = "123456789 John Doe 90 A New York City 987654321 Jane Doe 70 B CAL 473829105 Donald Trump 70 E 098743215 Bernie Sanders 92 A AR"

#split by whitespaces
output = input.split()

#create output to store as dictionary this could then be dumped to a json file
data = {'output':[]}
end = len(output)

i=0

while i< end:
    tmp = {}
    tmp['id'] = output[i]
    i=i+1
    tmp['fname']=output[i]
    i=i+1
    tmp['lname']=output[i]
    i=i+1
    tmp['score']=output[i]
    i=i+1
    tmp['letter']=output[i]
    i=i+1
    location = ""
    #Catch index out of bounds errors
    try:
        bool = output[i].isdigit()
        while not bool:
            location = location + " " + output[i]
            i=i+1
            bool = output[i].isdigit()
    except IndexError:
        print('Completed Array')

    tmp['location'] = location
    data['output'].append(tmp)

print(data)
N Cheadle
  • 93
  • 1
  • 7