0

Trying to write a script that exclude only rows from csv files under a specific directory, that is present in another csv file, and redirect the output to another csv. This something like an exception rule to apply.

Like from below input with considering the exception as below:

inDirectory/input.csv:

  Id    Name    Location    Data   Services  Action
  10    John    IN          1234   mail      active
  12    Samy    GR          5678   phone     disable
  28    Doug    UK          9123   phone     active

excDirectory/exception.csv:

  12    Samy    GR          5678   phone     disable

Wanted to redirect output as below:

outDirectory/output.csv:

  Id    Name    Location    Data   Services  Action
  10    John    IN          1234   mail      active
  28    Doug    UK          9123   phone     active

All i am able to write as below, which is incomplete and i am looking for a solution that perform the same. Any idea? i am very much new to Python scripting.

import pandas as pd

inDir = os.listdir('csv_out_tmp')
excFile = pd.read_csv('exclude/exception.csv', sep=',', index_col=0)
for csv in inDir:
  inFile = pd.read_csv('csv_out_tmp/' + csv)
  diff = set(inFile)^set(excFile)
  df[diff].to_csv('csv_out/' + csv, index=False)

Another way code i am writing as per @neotrinity

inDir = os.listdir('csv_out_tmp')
excFile = 'exclude/exception.csv'
for csv in inDir:
  inFile = open('csv_out_tmp/' + csv)
  excRow = set(open(excFile))
  with open('csv_out/' + csv, 'w') as f:
    for row in open(inFile):
      if row not in excRow:
        f.write(row)

With the above code the error i am getting as below

for row in open(inFile):
  TypeError: coercing to Unicode: need string or buffer, file found
Rio
  • 595
  • 1
  • 6
  • 27
  • And what should the exception be? – Milan Velebit Sep 13 '18 at 08:52
  • the exception would be if the same row or line present in the exception.csv. So the output csv can contain only rows or lines that is not present in the exception.csv. – Rio Sep 13 '18 at 08:56
  • 1
    Possible duplicate of : https://stackoverflow.com/questions/36891977/pandas-diff-of-two-dataframes – Max Sep 13 '18 at 09:08
  • I have just updated the script section as per the post @Max shared. Please let me know if that is the appropriate way of writing or i need to modify. – Rio Sep 13 '18 at 09:27
  • Post shared : try `set(InDir)^set(excFile)` (for those that don't know what I shared as I deleted it before [never used pandas before so didn't think it'd be helpful]) – Max Sep 13 '18 at 09:31
  • WIth the above script updated, now i am getting `TypeError: unhashable type: 'set'` – Rio Sep 13 '18 at 10:03
  • @SamironMallick you are trying to open the file twice. Replace `for row in open(inFile):` with `for row in inFile:`. This is line number 7 in your updated snippet. – neotrinity Sep 14 '18 at 15:50
  • @neotrinity thanks now it is executed but seems not dropping rows that is common between files. – Rio Sep 14 '18 at 15:59

1 Answers1

-1

To be honest you do not need pandas for this.

in_file = 'in.csv'
out_file = 'out.csv'
exception_file = 'exp.csv'
exception_rows = set(open(exception_file))
with open('out.csv', 'w') as f:
    for row in open('in.csv'):
        if row not in exception_rows:
            f.write(row)

this uses the least resources. only load the exception file in memory.

iterate over the input file row by row and write into the output file (row by row) consuming the least memory.

after that you can use pandas for your data analytics.

neotrinity
  • 230
  • 1
  • 5
  • its just a snippet. Not writing prod code ¯\_(ツ)_/¯. The outfile is "properly" closed via the context manager. The same logic applies. – neotrinity Sep 13 '18 at 11:51
  • It's not "just a snippet", it's an answer in a technical knowledge database that will be read by people of all levels of experience - including complete newbies - and as such should show good practices. – bruno desthuilliers Sep 13 '18 at 11:56
  • well that should have been the first comment. Not a condescending message followed by a down vote. I would have gladly updated the snippet. – neotrinity Sep 13 '18 at 12:03
  • @neotrinity While doing so with the below modification, getting `TypeError: coercing to Unicode: need string or buffer, set found.` Here is the code `inDir = os.listdir('csv_out_tmp') excFile = 'exclude/exception.csv' for csv in inDir: inFile = set(open('csv_out_tmp/' + csv)) excRow = set(open(excFile)) with open('csv_out/' + csv, 'w') as f: for row in open(inFile): if row not in excRow: f.write(row)` – Rio Sep 14 '18 at 10:27
  • this line `inFile = set(open('csv_out_tmp/' + csv))` is incorrect. Should just be `inFile = open('csv_out_tmp/' + csv)`. However as the other commenter has mentioned. Its better open the files via a context manager. – neotrinity Sep 14 '18 at 10:29
  • Thank you @neotrinity , while following above advice getting `for row in open(inFile): TypeError: coercing to Unicode: need string or buffer, file found` . I have just updated the code in the question section. – Rio Sep 14 '18 at 13:05