Parse CSV data to get a count on rows with duplicate values

Question

I have a csv file (data.csv):

data
cn=Clark Kent,ou=users,ou=news,ou=employee,dc=company,dc=com
cn=Peter Parker,ou=News,ou=news,ou=employee,dc=company,dc=com
cn=Mary Jane,ou=News_HQ,ou=news,ou=employee,dc=company,dc=com
cn=Oliver Twist,ou=users,ou=news,ou=employee,dc=company,dc=com
cn=Mary Poppins,ou=Ice Cream,ou=ice cream,dc=company,dc=com
cn=David Tenant,ou=userMger,ou=ice cream,ou=employee,dc=company,dc=com
cn=Pepper Jack,ou=users,ou=store,ou=employee,dc=company,dc=com
cn=Eren Jaeger,ou=Store,ou=store,ou=employee,dc=company,dc=com
cn=Monty Python,ou=users,ou=store,dc=company,dc=com
cn=John Smith,ou=userMger,ou=store,ou=employee,dc=company,dc=com
cn=Anne Potts,ou=Sprinkles_HQ,ou=sprinkles,dc=company,dc=com
cn=Harry Styles,OU=Sprinkles,ou=sprinkles,ou=employee,dc=company,dc=com
cn=James Bond,ou=Sprinkles_HQ,ou=employee,dc=company,dc=com
cn=Harry Potter,ou=users,ou=sprinkles,ou=employee,dc=company,dc=com

I need to parse the data to the point where I can get a count of how many rows have the same name in ou. So for example, if there is Sprinkles_HQ, Sprinkles, or sprinkles, they should count as the same. And if a row has Sprinkles_HQ and sprinkles (two of the same name), the row should still count as one (not two).

My desired output is something similar to this:

News, 4
Ice Cream, 2
Store, 4
Sprinkles, 4

The first steps I took was to read my csv file, the convert to my csv into a dataframe. I did this using Pandas:

#open file
file = open(directory)

#read csv and the column I want
df = pd.read_csv(file, usecols=['data'])
#make into a dataframe
rowData = pd.DataFrame(df)

Then in order to make parsing my data easier for me, I separated each row into comma separated values. Then converted those values into a list of list (each row is a list). Then remove any None values. Then I need to move all the data that starts with 'OU=' into its own list, and if any of the data has 'user' or 'userMger' or 'employee', I will remove those values from the list. This is my code as of now:

#splits the dataframe into comma separate values
lines =rowData['data'].str.split(",", expand=True)

#makes dataframe into a list of lists
a = lines.values.tolist()

#make my list of lists into a single list
employeeList = []
for i in range(len(a)):
    for j in range(len(a[0])):
        #there are some None values once converted to a list
        if a[i][j] != None: 
           employeeList.append(a[i][j])

#list for storing only OUs
ouList = []

#moving the items to the ouList that are only OUs
for i in range(len(employeeList)):
    if employeeList[i].startswith('OU='):
        ouList.append(employeeList[i])

#need to iterate in reverse as I am removing items from the list
#here I remove the other items
for i in reversed(range(len(ouList))):
     if ouList[i].endswith('users') or ouList[i].endswith('userMger') or ouList[i].endswith('employee'):
        ouList.remove(ouList[i])
        
#my list now only contains specific OUs        
print(ouList)

I believe I'm on the right track, my code is not removing any duplicates yet in each list within the list, such as Sprinkles_HQ, Sprinkles, or sprinkles. Before I make my employeelist list, I need to find a way to remove duplicates, and append them to a new list. This will make it easier for me to count.

I've researched how to remove duplicates in a list of lists. I attempted using something along the lines as:

new_list = []
for elem in a:
    if a not in new_list:
        new_list.append(elem)

But this doesn't take into consideration the words that start the same. I attempted using startswith and .lower() as there are upper and lower cases, but does not work for me yet:

new_list=[]
for i in range(len(a)):
    for j in range(len(a[0])):
        if a[i][j].lower().startswith(a[i][j].lower()) not in new_list:
           new_list.append(a[i][j])

Any suggestions, will be greatly appreciated.

see this guide: https://pandas.pydata.org/docs/user_guide/text.html — Michael Delgado, Mar 19 '22 at 21:57
Does this answer your question? [Pandas, groupby and count](https://stackoverflow.com/questions/47320572/pandas-groupby-and-count) — Michael Delgado, Mar 19 '22 at 21:59
@MichaelDelgado Hi Michael, this is a great suggestion. However, the `.groupby(['data'])['data'].count()` doesn't work well with my DataFrame. So I split the data into columns to see if this also works by doing running this: `lines =rowData['data'].str.split(",", expand=True)` and `linesDF = pd.DataFrame(lines)`. This also didn't work as I hoped. — noobCoder, Mar 20 '22 at 04:00
@MichaelDelgado I have another question, is it possible to strip the ou's and remove all `cn=name` and `dc=company/com`? This might help do some cleanup to only work with data I need. — noobCoder, Mar 20 '22 at 04:01
If you have the same fields in each row, I’d skip the first row and read the file as multiple columns with read_csv. Check out the [.str string methods](https://pandas.pydata.org/docs/user_guide/text.html#string-methods) for all kinds of string operations. — Michael Delgado, Mar 20 '22 at 14:38
@MichaelDelgado Thank you, Michael. I made some edits in my question above, I believe I'm very close to my goal. Once I'm able to remove the duplicates, and run the rest of my program, I believe grouping and counting them shouldn't be a problem. — noobCoder, Mar 20 '22 at 15:06
@MichaelDelgado Hi Michael, I was able to solve this problem. I have another question, can I only post the solution as an edit on my question or do I need to make an answer for other users to see how I solved it? — noobCoder, Mar 21 '22 at 04:42
don't edit your question to include an answer. if you have an answer to your own question you can feel free to post it as an answer. see the [ask] and [answer] guides for more help. please do edit your post though to make sure the question actually matches what you're looking for. I think your question is quite likely to be closed as a duplicate - there are a ton of questions on SO asking how to count duplicates in a CSV file. — Michael Delgado, Mar 21 '22 at 04:49
@MichaelDelgado Thank you, Michael. Yes I definitely need to edit my question. I can see how it might be labeled a duplicate. I'll take some time probably tomorrow to edit my question and then properly place my answer. Thank you again for your helpful comments. — noobCoder, Mar 21 '22 at 04:55

noobCoder · Accepted Answer · 2022-03-21T15:19:04.807

The solution I came up with is in parts. My first issue was the casing, I need everything to be in lowercase. So after I appended items to employeeList, I added this code:

for i in range(len(employeeList)):
    for j in range(len(employeeList[i])):
        employeeList[i][j] = employeeList[i][j].lower()

This makes everything in my employeeList lowercase.

Now once I fixed that, I then needed to change the output of my ouList from a single list, and keep it as a list of lists. So all rows with only ou= are going to be in ouList.

#list for storing only OUs
ouList = []

#moving the items to the ouList that are only OUs
for i in range(len(employeeList)):
    ouList.append([])
    for j in range(len(employeeList[i])):
        if employeeList[i][j].startswith('ou='):
           ouList[i].append(employeeList[i][j])

Then I needed to remove any items that end with users, userMger, or employee. I reverse iterated and used .endswith() to achieve this without any errors.

#need to iterate in reverse as I am removing items from the list
for i in reversed(range(len(ouList))):
     for j in reversed(range(len(ouList[i]))):
        if (ouList[i][j].endswith('users')
        or ouList[i][j].endswith('usermger')
        or ouList[i][j].endswith('employee')):
           ouList[i].remove(ouList[i][j])

Then to strip ou= or unnecessary strings, I used re (aka regular expressions or regex). Then I appended these new values to another list called ouListStrip

#stripping ou= and other strings
ouListStrip = []
for i in range(len(ouList)):
     ouListStrip.append([])
     for j in range(len(ouList[i])):
       ou = re.sub("ou=|_hq", "", ouList[i][j])
       ouListStrip[i].append(ou)

This list outputs this:

[['news'], ['news', 'news'], ['news', 'news'], ['news'], ['ice cream', 'ice cream'], ['ice cream'], ['store'], ['store', 'store'], ['store'], ['store'], ['sprinkles', 'sprinkles'], ['sprinkles', 'sprinkles'], ['sprinkles'], ['sprinkles']]

Now that I have only a list of lists, I can now work on removing duplicates in the sublists. I achieve this through using not in and appending them still as a list of lists.

no_repeats = []
for i in range(len(ouListStrip)):
     no_repeats.append([])
     for j in range(len(ouListStrip[i])):
       if ouListStrip[i][j] not in no_repeats[i]:
          no_repeats[i].append(ouListStrip[i][j])

no_repeats outputs this:

[['news'], ['news'], ['news'], ['news'], ['ice cream'], ['ice cream'], ['store'], ['store'], ['store'], ['store'], ['sprinkles'], ['sprinkles'], ['sprinkles'], ['sprinkles']]

Finally, I combine my list of list items into one single list:

allOUs = []
for i in range(len(no_repeats)):
    for j in range(len(no_repeats[i])):
        allOUs.append(no_repeats[i][j])

allOUs outputs:

['news', 'news', 'news', 'news', 'ice cream', 'ice cream', 'store', 'store', 'store', 'store', 'sprinkles', 'sprinkles', 'sprinkles', 'sprinkles']

Then I make this list into a dictionary and count the items within it using .count():

dict_of_counts = {item:allOUs.count(item) for item in allOUs}

Outputs:

{'news': 4, 'ice cream': 2, 'store': 4, 'sprinkles': 4}

To make it visually similar to what I want:

for key, value in dict_of_counts.items():
    print(key,',',value)

Outputs:

news , 4
ice cream , 2
store , 4
sprinkles , 4

Parse CSV data to get a count on rows with duplicate values

1 Answers1