2

I am using Python 3.4, Pandas 0.16.2 and Jupyter Notebook as my IDE.

I am importing the following two dataframes as follows:

Lookup = pd.read_excel("LookupMergeOutput.xlsx")

Concatenated = pd.read_csv('Concatenated.csv', error_bad_lines = False, na_values='', iterator=True,  chunksize=1000)
data = pd.concat([chunk for chunk in Concatenated], ignore_index=True)

Data

Lookup dataframe spec: 23353 rows 8; 8 columns

Columns:          Dtype:

LOGINNAME         object
FIRSTNAME         object
LASTNAME          object
ID_y               int64
CREATEUSERID       int64
REVISIONUSERID     int64
BEGINDATE         object
ENDDATE           object

data dataframe spec: 23653 rows; 667 columns

The column of interest for the merge in this dataframe is REFERENCE_ID with

data.REFERENCE_ID.dtype 

as 'object'

The columns to merge on are ID_y and REFERENCE_ID (in the data dataframe). The columns in Lookup contain unique value counts (eg. 265926639 etc).

The data dataframe has unique values, but also repeating instances of an an entry (for example, three instances of 265946567. Moreover, there are error strings with the following caricature: 421D6158-22D4-EDAC-0DEA-33B1FB5CC7AF.

Goal: To merge Lookup and data on the to key columns ID_y and REFERENCE_ID, respectively.

***References and prior search for answers but with little fruit to bear:

  1. Left merge without reindexing
  2. Conditional Merge
  3. Chris Albon's Tutorial
  4. Merging multiple dataframes
  5. Data Carpentry's Review

What I have attempted so far:

merged_left = pd.merge(left=Lookup,right=data, how='left', left_on='ID_y', right_on='REFERENCE_ID')

This returns all of the Lookup data but all 667 columns of the data dataframe come back as null values.

In theory, if there are common numbers in the two "key" columns, the data from the data dataframe to match. My hypothesis is either that there is a data type conflict or because there are repeating numbers in the data dataframe, that the merge is not taking place. I want to avoid dropping the duplicates or multiple instances of keys in the data column as this may have valid data.

I have also tried to rename the ID_ycolumnname to REFERENCE_ID and perform a merge as follows:

Lookup=Lookup.rename(columns = {'ID_y':'REFERENCE_ID'})

This produces a null dataframe (just columns, but no values). I have also attempted to set the index as ID_y and REFERENCE_ID and then perform the merge on the index:

Lookup = Lookup.set_index('ID_y')
data = data.set_index('REFERENCE_ID')

merged_on_index = pd.merge(Lookup,data, left_index=True,right_index=True)

I receive the following traceback:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-35-3909fd759082> in <module>()
----> 1 pd.merge(AgentLookup,data,left_index=True,right_index=True)

C:\WinPython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\pandas\tools\merge.py in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy)
     36                          right_index=right_index, sort=sort, suffixes=suffixes,
     37                          copy=copy)
---> 38     return op.get_result()
     39 if __debug__:
     40     merge.__doc__ = _merge_doc % '\nleft : DataFrame'

C:\WinPython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\pandas\tools\merge.py in get_result(self)
    184 
    185     def get_result(self):
--> 186         join_index, left_indexer, right_indexer = self._get_join_info()
    187 
    188         ldata, rdata = self.left._data, self.right._data

C:\WinPython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\pandas\tools\merge.py in _get_join_info(self)
    257         if self.left_index and self.right_index:
    258             join_index, left_indexer, right_indexer = \
--> 259                 left_ax.join(right_ax, how=self.how, return_indexers=True)
    260         elif self.right_index and self.how == 'left':
    261             join_index, left_indexer, right_indexer = \

C:\WinPython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\pandas\core\index.py in join(self, other, how, level, return_indexers)
   2041             other = other.astype('O')
   2042             return this.join(other, how=how,
-> 2043                              return_indexers=return_indexers)
   2044 
   2045         _validate_join_method(how)

C:\WinPython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\pandas\core\index.py in join(self, other, how, level, return_indexers)
   2054             else:
   2055                 return self._join_non_unique(other, how=how,
-> 2056                                              return_indexers=return_indexers)
   2057         elif self.is_monotonic and other.is_monotonic:
   2058             try:

C:\WinPython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\pandas\core\index.py in _join_non_unique(self, other, how, return_indexers)
   2125 
   2126         left_idx, right_idx = _get_join_indexers([self.values], [other.values],
-> 2127                                                  how=how, sort=True)
   2128 
   2129         left_idx = com._ensure_platform_int(left_idx)

C:\WinPython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\pandas\tools\merge.py in _get_join_indexers(left_keys, right_keys, sort, how)
    459 
    460     # get left & right join labels and num. of levels at each location
--> 461     llab, rlab, shape = map(list, zip( * map(fkeys, left_keys, right_keys)))
    462 
    463     # get flat i8 keys from label lists

TypeError: type object argument after * must be a sequence, not map

At this point in time, I at a loss as to what would be the best strategy. I welcome feedback on next steps.

Community
  • 1
  • 1
ahlusar1989
  • 349
  • 5
  • 16

1 Answers1

2

Both your join fields are different data types. REFERENCE_ID as an object type and ID_y as a numeric int64 type. Interestingly, the object dtype in pandas is a NumPy ndarray which holds pointers to variable-length string items. Also, object is the most general of the data types when imported data contains both strings and numbers.

Consider converting the join fields to same data types using astype. Of course, strings tend to be most permissive allowing various characters like hyphens.

String conversion:

data['REFERENCE_ID'] = data['REFERENCE_ID'].apply(str)  # NOTICE APPLY HERE
LookUp['ID_y'] = LookUp['ID_y'].astype(str)

Numeric conversion:

data['REFERENCE_ID'] = data['REFERENCE_ID'].astype(int)

Combination:

data['REFERENCE_ID'] = data['REFERENCE_ID'].astype(str).astype(int)
LookUp['ID_y'] = LookUp['ID_y'].astype(str).astype(int)
Community
  • 1
  • 1
Parfait
  • 104,375
  • 17
  • 94
  • 125