using pandas to compare large CSV files with different numbers of columns

Question

I am new at python programming and I am trying to join two csv files with different numbers of columns. The aim is to find missing records and create a report with specific columns from the master column.

An example of two csv files copied directly from excel SAMPLE CSV 1(combine201709.csv)

start_time  end_time    aitechid    hh_village  grpdetails1/farmername  grpdetails1/farmermobile
2016-11-26T14:01:47.329+03  2016-11-26T14:29:05.042+03  AI00001 2447    KahsuGebru  919115604
2016-11-26T19:34:42.159+03  2016-11-26T20:39:27.430+03  936891238   2473    Moto Aleka  914370833
2016-11-26T12:13:23.094+03  2016-11-26T14:25:19.178+03  914127382   2390    Hagos   914039654
2016-11-30T14:31:28.223+03  2016-11-30T14:56:33.144+03  920784222

SAMPLE CSV 2 (combinedmissingrecords.csv)

farmermobile
941807851
946741296
9
920212218
915
939555303
961579437
919961811
100004123
972635273
918166831
961579437
922882638
100006273
919728710
30000739
920770648
100004727
963767487
915855665
932255143
923531603
0
931875236
918027506
8
916353266
918020303
924359729
934623027
916585963
960791618
988047183
100002632
300007241
918271897
300007238
918250712

I tried this, but was unable to get the expected output:

    import pandas as pd

normalize = lambda x: "%.4f" % float(x) # round
df = pd.read_csv("/media/dmogaka/DATA/week progress/week4/combine201709.csv", index_col=(0,1), usecols=(1, 2, 3,4),
                 header=None, converters=dict.fromkeys([1,2]))
df2 = pd.read_csv("/media/dmogaka/DATA/week progress/week4/combinedmissingrecords.csv", index_col=(0,1), usecols=(0),
                  header=None, converters=dict.fromkeys([1,2]))
result = df2.merge(df[['aitechid','grpdetails1/farmermobile','grpdetails1/farmername']],
         left_on='farmermobile', right_on='grpdetails1/farmermobile')
result.to_csv("/media/dmogaka/DATA/week progress/week4/output.csv", header=None) # write as csv

error message

/usr/bin/python3.5 "/media/dmogaka/DATA/Panda tut/test/test.py"
Traceback (most recent call last):
  File "/media/dmogaka/DATA/Panda tut/test/test.py", line 7, in <module>
    header=None, converters=dict.fromkeys([1,2]))
  File "/home/dmogaka/.local/lib/python3.5/site-packages/pandas/io/parsers.py", line 655, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/home/dmogaka/.local/lib/python3.5/site-packages/pandas/io/parsers.py", line 405, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/home/dmogaka/.local/lib/python3.5/site-packages/pandas/io/parsers.py", line 764, in __init__
    self._make_engine(self.engine)
  File "/home/dmogaka/.local/lib/python3.5/site-packages/pandas/io/parsers.py", line 985, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "/home/dmogaka/.local/lib/python3.5/site-packages/pandas/io/parsers.py", line 1605, in __init__
    self._reader = parsers.TextReader(src, **kwds)
  File "pandas/_libs/parsers.pyx", line 461, in pandas._libs.parsers.TextReader.__cinit__ (pandas/_libs/parsers.c:4968)
TypeError: 'int' object is not iterable

Process finished with exit code 1

Possible duplicate of [Comparing two pandas dataframes for differences](https://stackoverflow.com/questions/19917545/comparing-two-pandas-dataframes-for-differences) — MrE, Sep 16 '17 at 20:10
@MrE, I don't think it's a duplicate. If we have different # of columns `assert_frame_equal` will always be returning `AssertionError` — MaxU - stand with Ukraine, Sep 16 '17 at 20:16
Can you post two small (3-5 rows) sample reproducible data sets and your desired resulting data set? Please read [how to make good reproducible pandas examples](http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) and edit your post correspondingly. — MaxU - stand with Ukraine, Sep 16 '17 at 20:17
just guessing if yo uwant to compare dataframes, they should have the same format. So the first step is trim / adjust format to get comparable DFs, then compare as per the other post — MrE, Sep 16 '17 at 20:19
@MrE, imagine that we want to see which rows are missing in first DF that are present in the second one... — MaxU - stand with Ukraine, Sep 16 '17 at 20:22
@MaxU i have updated data from the two csv files, i really need a solution that works — Mirieri Mogaka, Sep 16 '17 at 21:41
@MaxU my desired data set is farmername, aitechid and farmermobile(Primary Key) — Mirieri Mogaka, Sep 16 '17 at 21:55

score 1 · Accepted Answer · answered Sep 16 '17 at 21:58

1

Try this:

d2.merge(d1[['aitechid','grpdetails1/farmermobile','grpdetails1/farmername']], 
         left_on='farmermobile', right_on='grpdetails1/farmermobile')

or

d2.merge(d1[['aitechid','grpdetails1/farmermobile','grpdetails1/farmername']] \
          .rename(columns={'grpdetails1/farmermobile':'farmermobile'}))

answered Sep 16 '17 at 21:58

MaxU - stand with Ukraine

205,989
36
386
419

i have tried your code but i keep getting the error message above @MaxU – Mirieri Mogaka Sep 16 '17 at 22:31

using pandas to compare large CSV files with different numbers of columns

1 Answers1