I have a csv file (data.csv):
data
cn=Clark Kent,ou=users,ou=news,ou=employee,dc=company,dc=com
cn=Peter Parker,ou=News,ou=news,ou=employee,dc=company,dc=com
cn=Mary Jane,ou=News_HQ,ou=news,ou=employee,dc=company,dc=com
cn=Oliver Twist,ou=users,ou=news,ou=employee,dc=company,dc=com
cn=Mary Poppins,ou=Ice Cream,ou=ice cream,dc=company,dc=com
cn=David Tenant,ou=userMger,ou=ice cream,ou=employee,dc=company,dc=com
cn=Pepper Jack,ou=users,ou=store,ou=employee,dc=company,dc=com
cn=Eren Jaeger,ou=Store,ou=store,ou=employee,dc=company,dc=com
cn=Monty Python,ou=users,ou=store,dc=company,dc=com
cn=John Smith,ou=userMger,ou=store,ou=employee,dc=company,dc=com
cn=Anne Potts,ou=Sprinkles_HQ,ou=sprinkles,dc=company,dc=com
cn=Harry Styles,OU=Sprinkles,ou=sprinkles,ou=employee,dc=company,dc=com
cn=James Bond,ou=Sprinkles_HQ,ou=employee,dc=company,dc=com
cn=Harry Potter,ou=users,ou=sprinkles,ou=employee,dc=company,dc=com
I need to parse the data to the point where I can get a count of how many rows have the same name in ou
. So for example, if there is Sprinkles_HQ
, Sprinkles
, or sprinkles
, they should count as the same. And if a row has Sprinkles_HQ
and sprinkles
(two of the same name), the row should still count as one (not two).
My desired output is something similar to this:
News, 4
Ice Cream, 2
Store, 4
Sprinkles, 4
The first steps I took was to read my csv file, the convert to my csv into a dataframe. I did this using Pandas:
#open file
file = open(directory)
#read csv and the column I want
df = pd.read_csv(file, usecols=['data'])
#make into a dataframe
rowData = pd.DataFrame(df)
Then in order to make parsing my data easier for me, I separated each row into comma separated values. Then converted those values into a list of list (each row is a list). Then remove any None values. Then I need to move all the data that starts with 'OU=' into its own list, and if any of the data has 'user' or 'userMger' or 'employee', I will remove those values from the list. This is my code as of now:
#splits the dataframe into comma separate values
lines =rowData['data'].str.split(",", expand=True)
#makes dataframe into a list of lists
a = lines.values.tolist()
#make my list of lists into a single list
employeeList = []
for i in range(len(a)):
for j in range(len(a[0])):
#there are some None values once converted to a list
if a[i][j] != None:
employeeList.append(a[i][j])
#list for storing only OUs
ouList = []
#moving the items to the ouList that are only OUs
for i in range(len(employeeList)):
if employeeList[i].startswith('OU='):
ouList.append(employeeList[i])
#need to iterate in reverse as I am removing items from the list
#here I remove the other items
for i in reversed(range(len(ouList))):
if ouList[i].endswith('users') or ouList[i].endswith('userMger') or ouList[i].endswith('employee'):
ouList.remove(ouList[i])
#my list now only contains specific OUs
print(ouList)
I believe I'm on the right track, my code is not removing any duplicates yet in each list within the list, such as Sprinkles_HQ
, Sprinkles
, or sprinkles
. Before I make my employeelist
list, I need to find a way to remove duplicates, and append them to a new list. This will make it easier for me to count.
I've researched how to remove duplicates in a list of lists. I attempted using something along the lines as:
new_list = []
for elem in a:
if a not in new_list:
new_list.append(elem)
But this doesn't take into consideration the words that start the same.
I attempted using startswith
and .lower()
as there are upper and lower cases, but does not work for me yet:
new_list=[]
for i in range(len(a)):
for j in range(len(a[0])):
if a[i][j].lower().startswith(a[i][j].lower()) not in new_list:
new_list.append(a[i][j])
Any suggestions, will be greatly appreciated.