I need to match an EXACT substring in a pandas text column. However, when that dataframe text column has duplicate entries I get: ValueError: cannot reindex from a duplicate axis.
I reviewed the following post to determine how to query rows, but it was mostly about matching a whole entry rather than a substring. Select rows from a DataFrame based on values in a column in pandas
The following post showed how to find a substring using a regex pattern, which is EXACTLY what I need to look for regex word boundaries and what I am using below. How to filter rows containing a string pattern from a Pandas dataframe
I was able to get the code from the 2nd SO post above working except where I have duplicates in my Comments column. Note, entries 600 and 700 in the debug.txt file below are dupes which is fine. These dupes are expected, so how do I accommodate them?
The data file 'debug.txt' and hence the dataframe has 2 unique columns, so it's not a dataframe issue with duplicate column names per this post: Source: ValueError: cannot reindex from a duplicate axis using isin with pandas
----- debug.txt -----
PKey, Comments
100,Bad damaged product need refund.
200,second item
300,a third item goes here
400,Outlier text
500,second item
600,item
700,item
My code is below. Any help you can provide solving the ValueError above would be appreciated.
import re
import pandas as pd
# Define params used below
fileHeader = True
dictB = {}
inputFile = open("debug.txt", 'r')
if fileHeader == True:
inputFile.readline()
for line in inputFile:
inputText = line.split(",")
primaryKey = inputText[0]
inputTexttoAnalyze = inputText[1]
# Clean inputTexttoAnalyze and do other things...
# NOTE: Very inefficient to add 1 row at a time to a Pandas DF.
# They suggest combining the data in some other variable (like my dictionary)
# then copy it to the DF.
# Source: https://stackoverflow.com/questions/10715965/add-one-row-in-a-pandas-dataframe
dictB[primaryKey] = inputTexttoAnalyze
inputFile.close()
# Below is a List of words that must produce an EXACT match to a *substring* within
# the data frame Comments column.
findList = ["damaged product", "item"]
print("\nResults should ONLY have", findList, "\n")
dfB = pd.DataFrame.from_dict(dictB, orient='index').reset_index()
dfB.rename(columns={'index': 'PKey', 0: 'Comments'}, inplace=True)
for entry in findList:
rgx = '({})'.format("".join(r'(\b%s\b)' % entry))
# The following line gives the error: ValueError: cannot reindex from a duplicate axis.
# I DO have expected duplicate values in my input file.
resultDFb = dfB.set_index('Comments').filter(regex=rgx, axis=0)
for key in resultDFb['PKey']:
print(entry, key)
# This SO post says to run .index.duplicated() to see duplicated results, but I # don't see any, which is odd since there ARE duplicate results.
# https://stackoverflow.com/questions/38250626/valueerror-cannot-reindex-from-a-duplicate-axis-pandas
print(dfB.index.duplicated())