1

I have a dataset that I would like to parse in order to analyze it. I want to pull out specific columns, and then separate them before and after a non-uniform row. Here is an example of what my data looks like: Note the three rows in the middle that do not match the format of the other rows:

1386865618963   1   M   subject_avatar  3.636229    1.000000    5.422941    30.200327   0.000000    0.000000
1386865618965   1   M   subject_avatar  3.631835    1.000000    5.415390    30.200327   0.000000    0.000000
1386865618966   2   M   subject_avatar  3.627432    1.000000    5.407826    30.200327   0.000000    0.000000
1386865618968   1   M   subject_avatar  3.625223    1.000000    5.404030    30.200327   0.000000    0.000000
1386865618970   1   M   subject_avatar  3.620788    1.000000    5.396411    30.200327   0.000000    0.000000
1386865618970   0   D   4345048336
1386865618970   0   D   4345763672
1386865618971   0   I   BOXGEOM (45.0, 0.0, -45.0, 19.0, 3.5, 19.0) {'callback': <bound method YCEnvironment.dropoff of <navigate.YCEnvironment instance at 0x103065440>>, 'cbargs': (0, {'width': 1.75, 'image': <pyepl.display.Image object at 0x102f9da90>, 'height': 4.75, 'volbitSize': (0.5, 0.71999999999999997), 'name': 'Julia'}, {'width': 0.69999999999999996, 'name': 'Flower Patch', 'realpos': (45.0, 0.0, -45.0), 'image': <pyepl.display.Image object at 0x102fc3f50>, 'realsize': (7.0, 3.5, 7.0), 'type': 'store', 'volbitSize': (0.5, 0.5), 'height': 0.34999999999999998}), 'permiable': True}  4926595152
1386865618972   1   M   subject_avatar  3.621182    1.000000    5.396492    30.200327   0.000000    0.000000
1386865618992   2   M   subject_avatar  3.621182    1.000000    5.396492    30.200327   0.000000    0.000000
1386865618996   1   M   subject_avatar  3.621182    1.000000    5.396492    30.200327   0.000000    0.000000
1386865618998   2   M   subject_avatar  3.621182    1.000000    5.396492    30.200327   0.000000    0.000000
1386865619002   1   M   subject_avatar  3.621182    1.000000    5.396492    30.200327   0.000000    0.000000
1386865619005   1   M   subject_avatar  3.621182    1.000000    5.396492    30.200327   0.000000    0.000000
1386865619008   1   M   subject_avatar  3.621182    1.000000    5.396492    30.200327   0.000000    0.000000

I previously asked a question (Parsing specific columns from a dataset in python) to parse this data into columns, However, the columns only display the number of items in the column and not the items themselves.

I realize these are two different questions (separating into columns, separating before and after the non-uniform row), but any help with the parsing would be appreciated!

Community
  • 1
  • 1
Julia
  • 85
  • 1
  • 2
  • 8
  • What do you mean by "separate"? Do you just want to remove the D & I rows, or do you want each cluster of Ms to be grouped somehow? – DSM Jan 06 '14 at 16:40
  • I want to remove the D and I rows and cluster the Ms to show Ms that occurred before the D and I rows and Ms that occurred after the D and I rows. – Julia Jan 06 '14 at 20:09

2 Answers2

1

A straight forward idea:

You can preprocess the raw file to skip all irrelevant lines, maybe:

with open('raw.txt', 'r') as infile:
    f = infile.readlines()
    with open('filtered.txt', 'w') as outfile:
        for line in f:
            if 'subject_avatar' in line: # or other better rules
                outfile.write(line)

Then you process the filtered.txt the clean data using pandas or else.


with open('d.txt', 'r') as infile:
    f = infile.readlines()
    with open('filtered_part1.txt', 'w') as outfile:
        for i in range(len(f)):
            line = f[i]
            if line[16] == '0':
                i += 1
                break
            outfile.write(line)
    while f[i][16] == '0': # skip a few lines
        i += 1
    with open('filtered_part2.txt', 'w') as outfile:
        while i < len(f):
            outfile.write(f[i])
            i += 1

Ugly yet workable separation provided here. Basically to find the 0's and skip the lines.

Ray
  • 2,472
  • 18
  • 22
  • Thanks, this worked well! Now, do you know how I can distinguish between the data that came before and the data that came after the omitted lines? – Julia Jan 06 '14 at 17:55
  • @Julia Glad it worked. Do you have only one specific data file like this or the one above is only an illustration? – Ray Jan 07 '14 at 02:55
  • @Julia One way I can think of is to check the raw file line by line the second column or third (specific index of the string). Once you encounter those lines to omit, you know it is the end of first part and beginning of second. – Ray Jan 07 '14 at 03:00
  • This is just a small piece of one of the files of the data, it is 20,000 lines. Do you know how I can denote on the row that it is either before or after, possibly using the 0's in the second column? – Julia Jan 07 '14 at 13:51
  • @Julia See updated answer. Just as straight forward as it is. – Ray Jan 07 '14 at 14:59
0

If you would like to omit the non-uniform lines, you can simply check the length of each row:

rows = []
for line in lines:
    row = line.split()
    if len(row) == 10:
        rows.append(row)
Norbert Sebők
  • 1,208
  • 8
  • 13