0

I need to match an EXACT substring in a pandas text column. However, when that dataframe text column has duplicate entries I get: ValueError: cannot reindex from a duplicate axis.

I reviewed the following post to determine how to query rows, but it was mostly about matching a whole entry rather than a substring. Select rows from a DataFrame based on values in a column in pandas

The following post showed how to find a substring using a regex pattern, which is EXACTLY what I need to look for regex word boundaries and what I am using below. How to filter rows containing a string pattern from a Pandas dataframe

I was able to get the code from the 2nd SO post above working except where I have duplicates in my Comments column. Note, entries 600 and 700 in the debug.txt file below are dupes which is fine. These dupes are expected, so how do I accommodate them?

The data file 'debug.txt' and hence the dataframe has 2 unique columns, so it's not a dataframe issue with duplicate column names per this post: Source: ValueError: cannot reindex from a duplicate axis using isin with pandas

----- debug.txt -----

PKey, Comments
100,Bad damaged product need refund.
200,second item
300,a third item goes here
400,Outlier text
500,second item
600,item
700,item

My code is below. Any help you can provide solving the ValueError above would be appreciated.

import re
import pandas as pd

# Define params used below
fileHeader = True

dictB = {}

inputFile = open("debug.txt", 'r')

if fileHeader == True:
    inputFile.readline()

for line in inputFile:

    inputText = line.split(",")

    primaryKey = inputText[0]
    inputTexttoAnalyze = inputText[1]

    # Clean inputTexttoAnalyze and do other things...

    # NOTE: Very inefficient to add 1 row at a time to a Pandas DF. 
    # They suggest combining the data in some other variable (like my dictionary)
    # then copy it to the DF. 
    # Source: https://stackoverflow.com/questions/10715965/add-one-row-in-a-pandas-dataframe

    dictB[primaryKey] = inputTexttoAnalyze

inputFile.close()

# Below is a List of words that must produce an EXACT match to a *substring* within 
# the data frame Comments column. 
findList = ["damaged product", "item"]

print("\nResults should ONLY have", findList, "\n")


dfB = pd.DataFrame.from_dict(dictB, orient='index').reset_index()
dfB.rename(columns={'index': 'PKey', 0: 'Comments'}, inplace=True)

for entry in findList:
    rgx = '({})'.format("".join(r'(\b%s\b)' % entry))

    # The following line gives the error: ValueError: cannot reindex from a duplicate axis. 
    # I DO have expected duplicate values in my input file.
    resultDFb = dfB.set_index('Comments').filter(regex=rgx, axis=0)
    for key in resultDFb['PKey']:
        print(entry, key)

# This SO post says to run .index.duplicated() to see duplicated results, but I # don't see any, which is odd since there ARE duplicate results.  
# https://stackoverflow.com/questions/38250626/valueerror-cannot-reindex-from-a-duplicate-axis-pandas

print(dfB.index.duplicated())

RandomTask
  • 499
  • 8
  • 19
  • 1
    the reason for the error is likely related to duplicate index items. i.e. the column Comments which you set as index has duplicates. and hence the error. I would say that filter can be safely used when the indexes(columns or index keys) are guaranteed to be unique. otherwise. df[series.str.contains(pattern)] is a safer option – sgDysregulation Feb 05 '18 at 08:16
  • The code below gave me the results I need, however I'm curious about your proposal as well. I tried to apply it to my data frame in the code above --> dfB[series.str.contains(pattern)] but received an error "NameError: name 'series' is not defined". What goes in the "series"? – RandomTask Feb 07 '18 at 03:47

1 Answers1

1

One problem I see is that the header for Comments has a leading space in it (", Comment") which could be causing the problem in the DataFrame.

If I understood you correctly, you are trying to identify all the rows in the DataFrame where the Comments contain one of the values in findList

The following may work for you (after you remove the leading space from the Comments header).

import pandas as pd
import re

def check(s):
    for item in findList:
        if re.search(r'\b' + item + r'\b', s):
            return True
    return False


findList = ["damaged prod", "item"]

df = pd.read_csv("debug.txt")

df[df.Comments.apply(check)]

Out[9]: 
   PKey                          Comments
1   200                       second item
2   300            a third item goes here
4   500                       second item
5   600                              item
6   700                              item

Hope that helps.

PaW
  • 659
  • 4
  • 7
  • Using a regular expression will allow you to perform an exact match. I have edited the original suggestion and included the regular expression. – PaW Feb 10 '18 at 05:24