2

I'm not very experienced with complicated large-scale parsing in Python, do you guys have any tips or guides on how to easily parse multiple text files with different formats, and combining them into a single .csv file and ultimately entering them into a database?

An example of the text files is as follows:

general.txt (Name -- Department (DEPT) Room # [Age]

John Doe -- Management (MANG) 205 [Age: 40]
Equipment: Laptop, Desktop, Printer, Stapler
Experience: Python, Java, HTML
Description: Hardworking, awesome

Mary Smith -- Public Relations (PR) 605 [Age: 24] 
Equipment: Mac, PC
Experience: Social Skills
Description: fun to be around

Scott Lee -- Programmer (PG) 403 [Age: 25]
Equipment: Personal Computer
Experience: HTML, CSS, JS
Description: super-hacker

Susan Kim -- Programmer (PG) 504 [Age: 21]
Equipment: Desktop
Experience: Social Skills
Descriptions: fun to be around

Bob Simon  -- Programmer (PG) 101 [Age: 29]
Equipment: Pure Brain Power
Experience: C++, C, Java 
Description: never comes out of his room

cars.txt (a list of people who own cars by their department/room #)

Programmer: PG 403, PG 101
Management: MANG 205

house.txt

Programmer: PG 504

The final csv should preferably tabulate to something like:

Name     | Division    | Division Abbrevation | Equipment | Room | Age | Car? | House? |
Scott Lee  Programming          PG                 PC        403   25     YES     NO 
Mary Smith Public Rel.          PR               Mac, PC     605   24      NO     NO

The ultimate goal is to have a database, where searching "PR" would return every row where a person's Department is "PR," etc. There's maybe 30 text files total, each representing one or more columns in a database. Some columns are short paragraphs, which include commas. Around 10,000 rows total. I know Python has built in csv, but I'm not sure where to start, and how to end with just 1 csv. Any help?

zhuyxn
  • 6,671
  • 9
  • 38
  • 44
  • 1
    Does it have to be CSV? Are you sure you don't want a real database? ("Real database" doesn't necessarily mean "clunky", "slow" or "hard to do". `sqlite3` is a python builtin and it's easy, fast and fun.) – Li-aung Yip Mar 27 '12 at 09:30
  • Having your data in a real database also makes queries of the type you want very easy to do. For example, `SELECT * FROM people WHERE department = HR` is a lot easier than the equivalent search through a .csv file. – Li-aung Yip Mar 27 '12 at 09:32

4 Answers4

1

It looks like you're looking for someone who will solve a whole problem for you. Here I am :)

General idea is to parse general info to dict (using regular expressions), then append additional fields to it and finally write to CSV. Here's Python 3.x solution (I think Python 2.7+ should suffice):

import csv
import re


def read_general(fname):
    # Read general info to dict with 'PR 123'-like keys

    # Gerexp that will split row into ready-to-use dict
    re_name = re.compile(r'''
        (?P<Name>.+)
        \ --\  # Separator + space
        (?P<Division>.+)
        \  # Space
        \(
            (?P<Division_Abbreviation>.*)
        \)
        \  # Space
        (?P<Id>\d+)
        \  # Space
        \[Age:\  # Space at the end
            (?P<Age>\d+)
        \]
        ''', re.X)

    general = {}

    with open(fname, 'rt') as f:
        for line in f:
            line = line.strip()
            m = re_name.match(line)

            if m:
                # Name line, start new man
                man = m.groupdict()
                key = '%s %s' % (m.group('Division_Abbreviation'), m.group('Id'))
                general[key] = man

            elif line:
                # Non empty lines
                # Add values to dict
                key, value = line.split(': ', 1)
                man[key] = value

    return general


def add_bool_criteria(fname, field, general):
    # Append a field with YES/NO value

    with open(fname, 'rt') as f:
        yes_keys = set()

        # Phase one, gather all keys
        for line in f:
            line = line.strip()
            _, keys = line.split(': ', 1)

            yes_keys.update(keys.split(', '))

        # Fill data
        for key, man in general.items():  # iteritems() will be faster in Python 2.x
            man[field] = 'YES' if key in yes_keys else 'NO'


def save_csv(fname, general):
    with open(fname, 'wt') as f:
        # Gather field names
        all_fields = set()
        for value in general.values():
            all_fields.update(value.keys())

        # Write to csv
        w = csv.DictWriter(f, all_fields)
        w.writeheader()
        w.writerows(general.values())


def main():
    general = read_general('general.txt')
    add_bool_criteria('cars.txt', 'Car?', general)
    add_bool_criteria('house.txt', 'House?', general)
    from pprint import pprint
    pprint(general)
    save_csv('result.csv', general)


if __name__ == '__main__':
    main()

I wish you lot of $$$ for this ;)

Side note

CSV is a history, you could use JSON for storage and further use, because it's simpler to use, more flexible and human readable.

Tupteq
  • 2,986
  • 1
  • 21
  • 30
  • My ultimate goal is entering the .csv file into a MySQL database using LOAD DATA INFILE, and I'm not really sure of any other way to do it easily. Any tips on that? – zhuyxn Mar 27 '12 at 21:23
  • You only need to save CSV in appropriate format. I'm not a MySQL expert, so I'm not sure how exactly it should look like, but Python allows you to customize output format (more details [here](http://docs.python.org/library/csv.html#dialects-and-formatting-parameters)). You'll also need to include only fields you want in output data (pass through data and remove unwanted fields). Finally you need to define correct table in MySQL and use __LOAD DATA INFILE__. I could create a complete solution for you, but it would take me other hour or so and unfortunately I don't have time for this right now. – Tupteq Mar 28 '12 at 10:02
0

You just have a function which parses one file, and returns a list of dictionaries containing {'name': 'Bob Simon', 'age': 29, ...} etc. Then call this on each of your files, extending a master list. Then write this master list of dicts as a CSV file.

More elaborately:

First you need to parse the input files, you'd have a function which takes a file, and returns a list of "things".

def parse_txt(fname):
    f = open(fname)

    people = []

    # Here, parse f. Maybe using a while loop, and calling
    # f.readline() until there is an empty line Construct a
    # dictionary from each person's block, and append it to allpeople

    return people

This returns something like:

people = [
    {'name': 'Bob Simon', 'age': 29},
    {'name': 'Susan Kim', 'age': 21},
]

Then, loop over each of your input files (maybe by using os.listdir, or optparse to get a list of args):

allpeople = []
for curfile in args:
     people = parse_txt(fname = curfile)
     allpeople.extend(people)

So allpeople is a long list of all the people from all files.

Finally you can write this to a CSV file using the csv module (this bit usually involves another function to reorganise the data into a format more compatible with the writer module)

dbr
  • 165,801
  • 69
  • 278
  • 343
0

I'll do it backwards, I'll start by loading all those house.txt and cars.txt each one into a dict, that could look like:

cars = {'MANG': [205], 'PG': [403, 101]}

Since you said to have like 30 of them, you could easily use a nested dict without making things too complicated:

data = {'house': {'PG': 504}, 'cars': {...}}

Once the data dict will be complete, load general.txt and while building the dict for each employee (or whatever they are) do a dict look-up see if they have a house or not, or a car, etc..

For example for John Doe you'll have to check:

if data['house']['PG'].get(205):
    # ...

and update his dict accordingly. Obviously you don't have to hard code all the possible look-ups, just build a couple of lists of the ['house', 'cars', ...] or something like that and iterate over it.

At the end you should have a big list of dict with all the info merged, so just write each one of them to a csv file.

Rik Poggi
  • 28,332
  • 6
  • 65
  • 82
0

Best possible advise: Don't do that.

Your cars and house relations are, ummmm, interesting. Owning a house or a car is an attribute of a person or other entity (company, partnership, joint tenancy, tenancy in common, etc, etc). It is NOT an attribute of a ("division", room) combination. The first fact in your cars file is "A programmer in room 403 owns a car". What happens in the not unlikely event that there 2 or more programmers in the same room?

The equipment shouldn't be in a list.

Don't record age, record date or year of birth.

You need multiple tables in a database, not 1 CSV file. You need to study a book on elementary database design.

John Machin
  • 81,303
  • 11
  • 141
  • 189
  • Thanks, it's true I'm not an expert with databases, and I have a lot more to learn. The example I gave was purely an example. The real data is much more trivial than this, and there, the DEPT/room number links directly to a single person. I just thought it would be better to give something more relatable. – zhuyxn Mar 27 '12 at 21:22
  • I'm not sure what you mean by "trivial". Given realistic examples, not nonsensical ones. – John Machin Mar 27 '12 at 22:01