I am using Python 3.4, Pandas 0.16.2 and Jupyter Notebook as my IDE.
I am importing the following two dataframes as follows:
Lookup = pd.read_excel("LookupMergeOutput.xlsx")
Concatenated = pd.read_csv('Concatenated.csv', error_bad_lines = False, na_values='', iterator=True, chunksize=1000)
data = pd.concat([chunk for chunk in Concatenated], ignore_index=True)
Data
Lookup dataframe spec: 23353 rows 8; 8 columns
Columns: Dtype:
LOGINNAME object
FIRSTNAME object
LASTNAME object
ID_y int64
CREATEUSERID int64
REVISIONUSERID int64
BEGINDATE object
ENDDATE object
data dataframe spec: 23653 rows; 667 columns
The column of interest for the merge in this dataframe is REFERENCE_ID
with
data.REFERENCE_ID.dtype
as 'object'
The columns to merge on are ID_y
and REFERENCE_ID
(in the data dataframe). The columns in Lookup contain unique value counts (eg. 265926639 etc).
The data dataframe has unique values, but also repeating instances of an an entry (for example, three instances of 265946567. Moreover, there are error strings with the following caricature: 421D6158-22D4-EDAC-0DEA-33B1FB5CC7AF
.
Goal:
To merge Lookup and data on the to key columns ID_y
and REFERENCE_ID
, respectively.
***References and prior search for answers but with little fruit to bear:
- Left merge without reindexing
- Conditional Merge
- Chris Albon's Tutorial
- Merging multiple dataframes
- Data Carpentry's Review
What I have attempted so far:
merged_left = pd.merge(left=Lookup,right=data, how='left', left_on='ID_y', right_on='REFERENCE_ID')
This returns all of the Lookup data but all 667 columns of the data dataframe come back as null values.
In theory, if there are common numbers in the two "key" columns, the data from the data dataframe to match. My hypothesis is either that there is a data type conflict or because there are repeating numbers in the data dataframe, that the merge is not taking place. I want to avoid dropping the duplicates or multiple instances of keys in the data column as this may have valid data.
I have also tried to rename the ID_y
columnname to REFERENCE_ID
and perform a merge as follows:
Lookup=Lookup.rename(columns = {'ID_y':'REFERENCE_ID'})
This produces a null
dataframe (just columns, but no values). I have also attempted to set the index as ID_y
and REFERENCE_ID
and then perform the merge on the index:
Lookup = Lookup.set_index('ID_y')
data = data.set_index('REFERENCE_ID')
merged_on_index = pd.merge(Lookup,data, left_index=True,right_index=True)
I receive the following traceback:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-35-3909fd759082> in <module>()
----> 1 pd.merge(AgentLookup,data,left_index=True,right_index=True)
C:\WinPython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\pandas\tools\merge.py in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy)
36 right_index=right_index, sort=sort, suffixes=suffixes,
37 copy=copy)
---> 38 return op.get_result()
39 if __debug__:
40 merge.__doc__ = _merge_doc % '\nleft : DataFrame'
C:\WinPython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\pandas\tools\merge.py in get_result(self)
184
185 def get_result(self):
--> 186 join_index, left_indexer, right_indexer = self._get_join_info()
187
188 ldata, rdata = self.left._data, self.right._data
C:\WinPython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\pandas\tools\merge.py in _get_join_info(self)
257 if self.left_index and self.right_index:
258 join_index, left_indexer, right_indexer = \
--> 259 left_ax.join(right_ax, how=self.how, return_indexers=True)
260 elif self.right_index and self.how == 'left':
261 join_index, left_indexer, right_indexer = \
C:\WinPython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\pandas\core\index.py in join(self, other, how, level, return_indexers)
2041 other = other.astype('O')
2042 return this.join(other, how=how,
-> 2043 return_indexers=return_indexers)
2044
2045 _validate_join_method(how)
C:\WinPython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\pandas\core\index.py in join(self, other, how, level, return_indexers)
2054 else:
2055 return self._join_non_unique(other, how=how,
-> 2056 return_indexers=return_indexers)
2057 elif self.is_monotonic and other.is_monotonic:
2058 try:
C:\WinPython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\pandas\core\index.py in _join_non_unique(self, other, how, return_indexers)
2125
2126 left_idx, right_idx = _get_join_indexers([self.values], [other.values],
-> 2127 how=how, sort=True)
2128
2129 left_idx = com._ensure_platform_int(left_idx)
C:\WinPython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\pandas\tools\merge.py in _get_join_indexers(left_keys, right_keys, sort, how)
459
460 # get left & right join labels and num. of levels at each location
--> 461 llab, rlab, shape = map(list, zip( * map(fkeys, left_keys, right_keys)))
462
463 # get flat i8 keys from label lists
TypeError: type object argument after * must be a sequence, not map
At this point in time, I at a loss as to what would be the best strategy. I welcome feedback on next steps.