0

I'm looking for some help cycling through each group in my textfile and matching three variables with my csv and on a successfull match it will write a few new variables to the csv file:

In the textfile line 1 matches to csv element 1 In the textfile line 2 matches to csv element 0

Each Student will be split into three parts: 3 Tommy 144512/23332 and part 1 and part 3 will be written to element 12 and 13 respectivley. Part 2 will be used for the third match, matching to csv elelment 8 this is to find out which row to write to.

"data" will be written to element 14 (column 15) "misc3" will be written to element 15 (column 16) "bla3" will be written to element 16 (column 17)

Commented Textfile:

     Textfile Item 1 (Will loop/cycle/run 4 times, because there are 4 students)
           |
           v

MData (N/A)                <-- Match Line 1 (matches to csv element 1)
DMATCH1                    <-- Match Line 2 (matches to csv element 0)
3 Tommy 144512/23332       <-- Match Line 3 (matches to csv element 8) (Loop 1)                 
1 Jim 90000/222311     <-- Match Line 3 (matches to csv element 8) (Loop 2)
1 Elz M 90000/222311       <-- Match Line 3 (matches to csv element 8) (Loop 3)
1 Ben 90000/222311         <-- Match Line 3 (matches to csv element 8) (Loop 4)
Data $50.90                <-- If "Data" Exists then filewrite to csv element 14 (Loop 1)   
misc2 $10.40               <-- If "misc2" Exists then filewrite to csv element 15 (Loop 1)
bla3 $20.20               <-- If "bla3" Exists then filewrite to csv element 16 (Loop 1)


     Textfile Item 2 (Will loop/cycle/run 2 times, because there are 3 students)
           |
           v

MData (B/B)                <-- Match Line 1 (matches to csv element 1)
DMATCH2                    <-- Match Line 2 (matches to csv element 0)
4 James Smith 2333/114441  <-- Match Line 3 (matches to csv element 8) (Loop 1)
4 Mike 90000/222311        <-- Match Line 3 (matches to csv element 8) (Loop 2)
4 Jessica Long 2333/114441 <-- Match Line 3 (matches to csv element 8) (Loop 3)
Data $50.90                <-- If "Data" Exists then filewrite to csv element 14 (Loop 1)   
bla3 $5.44                <-- If "bla3" Exists then filewrite to csv element 16 (Loop 1)


     Textfile Item 3 (Will loop/cycle/run 2 times, because there are 2 students)
           |
           v

Mdata                      <-- Match Line 1 (matches to csv element 1)
DMATCH3                    <-- Match Line 2 (matches to csv element 0)
5 Joe Reane 0/0            <-- Match Line 3 (matches to csv element 8) (Loop 1)
5 Peter Jones 90000/222311 <-- Match Line 3 (matches to csv element 8) (Loop 2)
misc2 $420.00              <-- If "misc2" Exists then filewrite to csv element 15 (Loop 1)
bla3 $210.00               <-- If "bla3" Exists then filewrite to csv element 16 (Loop 1)

Uncommented Real Textfile:

MData (N/A)
DMATCH1
3 Tommy 144512/23332
1 Jim 90000/222311
1 Elz M 90000/222311
1 Ben 90000/222311
Data $50.90
misc2 $10.40
bla3 $20.20


MData (B/B) 
DMATCH2
4 James Smith 2333/114441
4 Mike 90000/222311
4 Jessica Long 2333/114441
Data $50.90
bla3 $5.44


Mdata
DMATCH3
5 Joe Reane 0/0
5 Peter Jones 90000/222311
Data $10.91
misc2 $420.00
bla3 $210.00

CSV Before:

MATCH1,MATCH2,TITLE,TITLE,TITLE,TITLE,TITLE,TITLE,MATCH3,DATA,TITLE,TITLE
DMATCH1,MData (N/A),data,data,data,data,data,data,Tommy,55,data,data
DMATCH1,MData (N/A),data,data,data,data,data,data,Ben,54,data,data
DMATCH1,MData (N/A),data,data,data,data,data,data,Jim,52,data,data
DMATCH1,MData (N/A),data,data,data,data,data,data,Elz M,22,data,data
DMATCH2,MData (B/B),data,data,data,data,data,data,James Smith,15,data,data
DMATCH2,MData (B/B),data,data,data,data,data,data,Jessica Long,224,data,data
DMATCH2,MData (B/B),data,data,data,data,data,data,Mike,62,data,data
DMATCH3,Mdata,data,data,data,data,data,data,Joe Reane,66,data,data
DMATCH3,Mdata,data,data,data,data,data,data,Peter Jones,256,data,data
DMATCH3,Mdata,data,data,data,data,data,data,Lesley Lope,5226,data,data

CSV After:

MATCH1,MATCH2,TITLE,TITLE,TITLE,TITLE,TITLE,TITLE,MATCH3,DATA,TITLE,TITLE,,,,,
DMATCH1,MData (N/A),data,data,data,data,data,data,Tommy,55,data,data,3,144512/23332,Data $50.90,misc2 $10.40,bla3 $20.20
DMATCH1,MData (N/A),data,data,data,data,data,data,Ben,54,data,data,1,90000/222311,,,
DMATCH1,MData (N/A),data,data,data,data,data,data,Jim,52,data,data,1,90000/222311,,,
DMATCH1,MData (N/A),data,data,data,data,data,data,Elz M,22,data,data,1,90000/222311,,,
DMATCH2,MData (B/B),data,data,data,data,data,data,James Smith,15,data,data,4,2333/114441,Data $50.90,,bla3 $5.44
DMATCH2,MData (B/B),data,data,data,data,data,data,Jessica Long,224,data,data,4,2333/114441,,,
DMATCH2,MData (B/B),data,data,data,data,data,data,Mike,62,data,data,4,90000/222311,,,
DMATCH3,Mdata,data,data,data,data,data,data,Joe Reane,66,data,data,5,0/0,,misc2 $420.00,bla3 $210.00
DMATCH3,Mdata,data,data,data,data,data,data,Peter Jones,256,data,data,5,90000/222311,,,
DMATCH3,Mdata,data,data,data,data,data,data,Lesley Lope,5226,data,data,,,,,

Anyone know how to acheive this?

Any help would be much much appreciated!

Kimvais
  • 38,306
  • 16
  • 108
  • 142
Ryflex
  • 5,559
  • 25
  • 79
  • 148

1 Answers1

5

There are actually several sub-problems in this problem. First we have to read the interestingly-formatted text file:

Read matchers text file

# each block in the text file will be one element of this list
matchers = [[]]
i = 0 
with open('test.txt') as infile:
    for line in infile:
        line = line.strip()
        # Blocks are seperated by blank lines
        if len(line) == 0:
            i += 1
            matchers.append([])
            # assume there are always two blank lines between items 
            # and just skip to the lext line
            infile.next()
            continue
        matchers[i].append(line)

At this point we have a list of lists, one element for each block, containing one element for each line. Then we have to convert to something more table-like

Convert to table-like format

import re

# This regular expression matches the variable number of students in each block
studentlike = re.compile('(\d+) (.+) (\d+/\d+)')
# We will build a table containing a list of elements for each student
table = []
for matcher in matchers:
    # We use an iterator over the block lines to make indexing simpler
    it = iter(matcher)
    # The first two elements are match values
    m1, m2 = it.next(), it.next()
    # then there are a number of students
    students = []
    for possiblestudent in it:
        m = studentlike.match(possiblestudent)
        if m:
            students.append(list(m.groups()))
        else:
            break
    # After the students come the data elements, which we read into a dictionary
    # We also add in the last possible student line as that didn't match the student re
    dataitems = dict(item.split() for item in [possiblestudent] + list(it))
    datanames = dataitems.keys()
    # Finally we construct the table
    for student in students:
        # We use the dictionary .get() method to return blanks for the missing fields
        table.append([m1, m2] + student + [dataitems.get(d, '') for d in datanames])
print table

Join with pandas

Now, we can join up the data. I've used Pandas here, as it is just great for this kind of joining:

import pandas
csvdata = pandas.read_csv('test.csv')
textdata = pandas.DataFrame(table, columns=['MATCH2', 'MATCH1', 'TITLE01', 'MATCH3', 'TITLE02', 'Data', 'misc2', 'bla3'])
mergeddata = pandas.merge(csvdata, textdata, how='left', on=['MATCH1', 'MATCH2', 'MATCH3'], sort=False)
mergeddata.to_csv('output.csv', index=False)
Community
  • 1
  • 1
chthonicdaemon
  • 19,180
  • 2
  • 52
  • 66
  • Works amazingly, would it be possible to ask if you could comment that as there are a fair few parts that I don't understand how they are doing what they are doing... – Ryflex Oct 05 '13 at 01:25
  • I've added comments. Feel free to ask directly if there are other things you still don't get. – chthonicdaemon Oct 05 '13 at 04:30
  • Thanks for that, it works brilliantly I have however come across one problem where the strings with the dollars or other symbols in for example: `Data £1.00` will come out as: `Data £1.00` how can I bypass/get around this? – Ryflex Oct 08 '13 at 03:39
  • This is an encoding problem. You will have to find out what encoding you use. [This table](http://www.i18nqa.com/debug/utf8-debug.html) is useful for that. It appears as though your data is encoded in Windows 1252 but Python is reading it as UTF-8. Fix your encoding in your base files and the problems should go away. – chthonicdaemon Oct 08 '13 at 06:30
  • Any idea how to make it so it matches to the same column if I run it twice (to and from the same file) overwriting/updating any old values instead of writing new columns? – Ryflex Oct 11 '13 at 05:45
  • I suspected that you original question had been to vague about the handling of the column names. I've updated the code to just add the column names that were in the text file. Then you just match similar columns to one another - pandas does this quite easily. If this is a recurring thing you may want to invest some time into merging the three match columns into one ID column that you can use as an index in pandas. You should ask a new question about merging the data frames with overwriting rather than going on in this thread. – chthonicdaemon Oct 11 '13 at 07:18