0

I have a bunch of files in a directory 698 to be exact. Each file contains a date and a unique ID, as well as a name. Like so:

import pandas as pd
from pandas import Series, DataFrame
import numpy as np
import csv
import os
import re

20151231_7801_Test_Maps.txt
20151231_7801_Test_Items.txt
20151231_7802_Test_Maps.txt
20151231_7802_Test_Items.txt

I'm looking to group them by date and identifier, open each of the files (Maps, and Items), and do a diff analysis on some ids in the file. How would I do this?

So far I have this as my code, but I don't know how to iterate through and open each file per group:

groups = defaultdict(list)
for filename in os.listdir('F:\Desktop'):
    date = filename[:8]
    identifier = filename[10:14]
    basename, extension = os.path.splitext(filename)
    groups[date, identifier].append(filename)

My output prints some of the groups correctly but not all, for example:

('20151231','7801')['20151231_7801_Test_Maps.txt, 20151231_7801_Test_Items.txt]

Some groups only print one file even though there are two files for that date and identifier.

This is not my main concern but once they are broken up in groups I'd like to assign each file in the group to a dataframe like so :

for key in groups:
    maps = pd.read_csv(file1, sep = '\t', usecols = ['ID'], skipfooter = 0, engine = 'python')
    items = pd.read_csv(file2, sep = '\t', usecols = ['ID'], skipfooter = 0, engine = 'python')

    #checks IDs between the two files and looks for differences
    set(maps.ID).difference(items.ID)

Can someone please help with grouping the files by date and ID, and iterating opening the files by group? Thanks!

staten12
  • 735
  • 3
  • 9
  • 20

2 Answers2

1

took some help from https://stackoverflow.com/a/20228113/6626530 and did it

   import pandas as pd


from collections import defaultdict

difference = pd.DataFrame(columns=('Filename1', 'Filename2', 'DiffID1','DiffID2'))

pathloc ='C:\Users\shmathew\Desktop\Sample\\abc\\'
groups = defaultdict(list)
for filename in os.listdir(pathloc):
    date = filename[:8]
    identifier = filename[10:14]
    basename, extension = os.path.splitext(filename)
    groups[date, identifier].append(filename)



for key,filenames  in groups.iteritems():
    #print " processing following files"
    #print filenames
    maps = pd.read_csv(pathloc+filenames[1], sep = '\t', usecols = ['ID'], skipfooter = 0, engine = 'python')
    Items = pd.read_csv(pathloc+filenames[0]  , sep = '\t', usecols = ['ID'], skipfooter = 0, engine = 'python')
    df = pd.concat([maps, Items])
    df = df.reset_index(drop=True)
    df_gpby = df.groupby(list(df.columns))
    idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1]




    #print "\n\n Difference \n\n"
    ids= (df.reindex(idx))
    row =list(filenames);
    row.extend(list(ids['ID']))

    print  row 
   # difference.append(row)
    difference.append(row)
print difference

output

['20151231_7802_Test_Items.txt', '20151231_7802_Test_Maps.txt', '00432931830TRNY1    ', '00432xx0TRNY1    ']
['20151231_7801_Test_Items.txt', '20151231_7801_Test_Maps.txt']
Empty DataFrame
Columns: [Filename1, Filename2, DiffID1, DiffID2]
Index: []
Community
  • 1
  • 1
Shijo
  • 9,313
  • 3
  • 19
  • 31
  • Thanks ! This works great, I'm wondering if there is a way to get this into a single dataframe column called Difference with the filename/ID next to each record? (Would be easier for reporting purposes to filter) – staten12 Jan 17 '17 at 19:47
  • updated the code, but couldnt make them into Dataframe – Shijo Jan 17 '17 at 21:26
0

Taking from Shijo's answer, I've found a pretty good way to do this.

groups = defaultdict(list)
output = []

for filename in os.listdir(pathloc):
date = filename[:8]
identifier = filename[14:18]
basename, extension = os.path.splitext(filename)
groups[date, identifier].append(filename)


for key, fnames in groups.iteritems():
filedicts = {}
print list(fnames)
maps = pd.read_csv(pathloc+fnames[1], sep = '\t', usecols = ['ID'], skipfooter = 0, engine = 'python')
items = pd.read_csv(pathloc+fnames[0], sep = '\t', usecols = ['ID'], skipfooter = 0, engine = 'python')



diffs = set(maps.ID).symmetric_difference(items.ID)

filedicts['FileIDKey'] = list(key)
filedicts['Missing_IDs'] = list(diffs)                         
filedicts['FileNames'] = fnames

output.append(filedicts)

This allows me to then go and add this master dictionary list to a dataframe:

new = pd.DataFrame(output)
staten12
  • 735
  • 3
  • 9
  • 20