3

I have a program that reads a .csv file, checks for any mismatch in column length (by comparing it to the header-fields), which then returns everything it found out as a list (and then writes it into a file). What I want to do with this list, is to list out the results as follows:

row numbers where the same mismatch is found : the amount of columns in that row

e.g.

rows: n-m : y

where n and m are the numbers of rows which share the same amount of columns that mismatch to header.

I have looked into these topics, and while the information is useful, they do not answer the question:

Find and list duplicates in a list?

Identify duplicate values in a list in Python

This is where I am right now:

r = csv.reader(data, delimiter= '\t')
columns = []
for row in r:
        # adds column length to a list
        colm = len(row)
        columns.append(colm)

b = len(columns)
for a in range(b):
        # checks if the current member matches the header length of columns
        if columns[a] != columns[0]:
                # if it doesnt, write the row and the amount of columns in that row to a file
                file.write("row  " + str(a + 1) + ": " + str(columns[a]) + " \n")

the file output looks like this:

row  7220: 0 
row  7221: 0 
row  7222: 0 
row  7223: 0 
row  7224: 0 
row  7225: 1 
row  7226: 1 

when the desired end result is

rows 7220 - 7224 : 0
rows 7225 - 7226 : 1

So I what I essentially need, the way i see it, is an dictionary where key is the rows with duplicate value and value is the amount of columns in that said mismatch. What I essentially think I need (in a horrible written pseudocode, that doesn't make any sense now that I'm reading it years after writing this question), is here:

def pseudoList():
    i = 1
    ListOfLists = []
    while (i < len(originalList)):
        duplicateList = []
        if originalList[i] == originalList[i-1]:
            duplicateList.append(originalList[i])
        i += 1
    ListOfLists.append(duplicateList)


def PseudocreateDict(ListOfLists):
    pseudoDict = {}
    for x in ListOfLists:
        a = ListOfLists[x][0]                   #this is the first node in the uniqueList created
        i = len(ListOfLists) - 1
        b = listOfLists[x][i]   #this is the last node of the uniqueList created
        pseudodict.update('key' : '{} - {}'.format(a,b))

This however, seems very convoluted way for doing what I want, so I was wondering if there's a) more efficient way b) an easier way to do this?

Christian W.
  • 2,532
  • 1
  • 19
  • 31

3 Answers3

1

You can also try the following code -

b = len(columns)
check = 0
for a in range(b):
        # checks if the current member matches the header length of columns
        if check != 0 and columns[a] == check:
            continue
        elif check != 0 and columns[a] != check:
            check = 0
            if start != a:
                file.write("row  " + str(start) + " - " + str(a) + ": " + str(columns[a]) + " \n")
            else:
                file.write("row  " + str(start) + ": " + str(columns[a]) + " \n")
        if columns[a] != columns[0]:
                # if it doesnt, write the row and the amount of columns in that row to a file
                start = a+1
                check = columns[a]
Anand S Kumar
  • 88,551
  • 18
  • 188
  • 176
1

You can use a list comprehension to return a list of elements in the columns list that differ from adjacent elements, which will be the end-points of your ranges. Then enumerate these ranges and print/write out those that differ from the first (header) element. An extra element is added to the list of ranges to specify the end index of the list, to avoid out of range indexing.

columns = [2, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 2, 1];

ranges = [[i+1, v] for i,v in enumerate(columns[1:]) if columns[i] != columns[i+1]]
ranges.append([len(columns),0]) # special case for last element 
for i,v in enumerate(ranges[:-1]):
    if v[1] != columns[0]:
        print "rows", v[0]+1, "-", ranges[i+1][0], ":", v[1]

output:

rows 2 - 5 : 1
rows 6 - 9 : 0
rows 10 - 11 : 1
rows 13 - 13 : 1
samgak
  • 23,944
  • 4
  • 60
  • 82
0

What you want to do is a map/reduce operation, but without the sorting that is normally done between the mapping and the reducing.

If you output

row  7220: 0 
row  7221: 0 
row  7222: 0 
row  7223: 0 

To stdout, you can pipe this data to another python program that generates the groups you want.

The second python program could look something like this:

import sys
import re


line = sys.stdin.readline()
last_rowid, last_diff = re.findall('(\d+)', line)

for line in sys.stdin:
    rowid, diff = re.findall('(\d+)', line)
    if diff != last_diff:
        print "rows", last_rowid, rowid, last_diff
        last_diff = diff
        last_rowid = rowid

print "rows", last_rowid, rowid, last_diff

You would execute them like this in a unix environment to get the output into a file:

python yourprogram.py | python myprogram.py > youroutputfile.dat

If you cannot run this on a unix environment, you can still use the algorithm I wrote in your program with a few modifications.

firelynx
  • 30,616
  • 9
  • 91
  • 101