Can I group files by Date and ID and do diff on them?

Question

I have a bunch of files in a directory 698 to be exact. Each file contains a date and a unique ID, as well as a name. Like so:

import pandas as pd
from pandas import Series, DataFrame
import numpy as np
import csv
import os
import re

20151231_7801_Test_Maps.txt
20151231_7801_Test_Items.txt
20151231_7802_Test_Maps.txt
20151231_7802_Test_Items.txt

I'm looking to group them by date and identifier, open each of the files (Maps, and Items), and do a diff analysis on some ids in the file. How would I do this?

So far I have this as my code, but I don't know how to iterate through and open each file per group:

groups = defaultdict(list)
for filename in os.listdir('F:\Desktop'):
    date = filename[:8]
    identifier = filename[10:14]
    basename, extension = os.path.splitext(filename)
    groups[date, identifier].append(filename)

My output prints some of the groups correctly but not all, for example:

('20151231','7801')['20151231_7801_Test_Maps.txt, 20151231_7801_Test_Items.txt]

Some groups only print one file even though there are two files for that date and identifier.

This is not my main concern but once they are broken up in groups I'd like to assign each file in the group to a dataframe like so :

for key in groups:
    maps = pd.read_csv(file1, sep = '\t', usecols = ['ID'], skipfooter = 0, engine = 'python')
    items = pd.read_csv(file2, sep = '\t', usecols = ['ID'], skipfooter = 0, engine = 'python')

    #checks IDs between the two files and looks for differences
    set(maps.ID).difference(items.ID)

Can someone please help with grouping the files by date and ID, and iterating opening the files by group? Thanks!

score 1 · Answer 1 · edited May 23 '17 at 12:24

took some help from https://stackoverflow.com/a/20228113/6626530 and did it

   import pandas as pd


from collections import defaultdict

difference = pd.DataFrame(columns=('Filename1', 'Filename2', 'DiffID1','DiffID2'))

pathloc ='C:\Users\shmathew\Desktop\Sample\\abc\\'
groups = defaultdict(list)
for filename in os.listdir(pathloc):
    date = filename[:8]
    identifier = filename[10:14]
    basename, extension = os.path.splitext(filename)
    groups[date, identifier].append(filename)



for key,filenames  in groups.iteritems():
    #print " processing following files"
    #print filenames
    maps = pd.read_csv(pathloc+filenames[1], sep = '\t', usecols = ['ID'], skipfooter = 0, engine = 'python')
    Items = pd.read_csv(pathloc+filenames[0]  , sep = '\t', usecols = ['ID'], skipfooter = 0, engine = 'python')
    df = pd.concat([maps, Items])
    df = df.reset_index(drop=True)
    df_gpby = df.groupby(list(df.columns))
    idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1]




    #print "\n\n Difference \n\n"
    ids= (df.reindex(idx))
    row =list(filenames);
    row.extend(list(ids['ID']))

    print  row 
   # difference.append(row)
    difference.append(row)
print difference

output

['20151231_7802_Test_Items.txt', '20151231_7802_Test_Maps.txt', '00432931830TRNY1    ', '00432xx0TRNY1    ']
['20151231_7801_Test_Items.txt', '20151231_7801_Test_Maps.txt']
Empty DataFrame
Columns: [Filename1, Filename2, DiffID1, DiffID2]
Index: []

Thanks ! This works great, I'm wondering if there is a way to get this into a single dataframe column called Difference with the filename/ID next to each record? (Would be easier for reporting purposes to filter) — staten12, Jan 17 '17 at 19:47

staten12 · Accepted Answer · 2017-01-20T13:55:25.357

Taking from Shijo's answer, I've found a pretty good way to do this.

groups = defaultdict(list)
output = []

for filename in os.listdir(pathloc):
date = filename[:8]
identifier = filename[14:18]
basename, extension = os.path.splitext(filename)
groups[date, identifier].append(filename)


for key, fnames in groups.iteritems():
filedicts = {}
print list(fnames)
maps = pd.read_csv(pathloc+fnames[1], sep = '\t', usecols = ['ID'], skipfooter = 0, engine = 'python')
items = pd.read_csv(pathloc+fnames[0], sep = '\t', usecols = ['ID'], skipfooter = 0, engine = 'python')



diffs = set(maps.ID).symmetric_difference(items.ID)

filedicts['FileIDKey'] = list(key)
filedicts['Missing_IDs'] = list(diffs)                         
filedicts['FileNames'] = fnames

output.append(filedicts)

This allows me to then go and add this master dictionary list to a dataframe:

new = pd.DataFrame(output)

Can I group files by Date and ID and do diff on them?

2 Answers2