1

I am new to python and I m learning by doing.

At this moment, my code is running quite slow and it seems to take longer and longer by each time I run it.

The idea is to download an employee list as CSV, then to check the location of each Employee ID by running it trough a specific page then writing it to an excel file.

We have around 600 associates on site each day and I need to find their location and to keep refreshing it each 2-4 minutes.

EDIT:

For everyone to have a better understanding, I have a CSV file ( TOT.CSV ) that contains Employee ID's, Names and other information of the associates that I have on site.

In order to get their location, I need to run each employee ID from that CSV file trough https://guided-coaching-dub.corp.amazon.com/api/employee-location-svc/GetLastSeenLocationOfEmployee?employeeId= 1 by 1 and at the same time to write it in another CSV file ( Location.csv ). Right now, it does in about 10 minutes and I want to understand if the way I did it is the best possible way, or if there is something else that I could try.

My code looks like this:

# GET EMPLOYEE ID FROM THE CSV


data = read_csv("Z:\\_Tracker\\Dump\\attendance\\TOT.csv")

# converting column data to list
TOT_employeeID = data['Employee ID'].tolist()


# Clean the Location Sheet


with open("Z:\\_Tracker\\Dump\\attendance\\Location.csv", "w") as f:
    pass

print("Previous Location data cleared ... ")


# go through EACH employee ID to find out location


for x in TOT_employeeID:
    driver.get(
        "https://guided-coaching-dub.corp.amazon.com/api/employee-location-svc/GetLastSeenLocationOfEmployee?employeeId=" + x)
    print("Getting Location data for EmployeeID: " + x)
    locData = driver.find_element(By.TAG_NAME, 'body').text
    aaData = str(locData)
    realLoc = aaData.split('"')

    # write to excel
    with open("Z:\\_Tracker\\Dump\\attendance\\Location.csv",
              "a") as f:
        writer = csv.writer(f)
        writer.writerow(realLoc)

time.sleep(5)
print("Employee Location data downloaded...")

Is there a way I can do this faster?

Thank you in advance!

Regards, Alex

  • One thing: change reportFormat=CSV in the FCLM URL. That way you can use requests which will be a lot faster that waiting for the JS to load. – Jason Baker Oct 12 '22 at 02:46
  • Please could you run more preliminary analysis to understand the time spent in each sub-task. That way, you could help others to help you by making the question more specific. Sharing the whole script is less helpful especially other people can't run the script unless one has the same package installed + same web/local file access as you do – Frederick Zhang Oct 12 '22 at 02:54
  • @FrederickZhang edited it, I hope you get a better understanding of what I m trying to get. – Alexandru Tudorie Oct 12 '22 at 03:16
  • @JasonBaker by changing the HTML to CSV, I get a 404 unauthorized error, it would've been a good ideea, thank you anyway! – Alexandru Tudorie Oct 12 '22 at 03:47
  • You have to use kerberos. Search inside amazon or wiki for jabaker fclm python. I left lots of wiki posts when I worked there. – Jason Baker Oct 12 '22 at 03:55
  • Hi @JasonBaker! Found your work and it's quite impressive! Rather than changing the way I m scrapping the data from FCLM, I'd like to find a solution to write more than 1 location at a time in the excel sheet. I was thinking about multithreading but I don't know yet how to do that or if it's possible. – Alexandru Tudorie Oct 12 '22 at 21:56

2 Answers2

0

You could try separating the step of reading the information and writing the information to your CSV file, like below:

# Get Employee Location Information
# Create list for employee information, to be used below
employee_Locations = []
 
for x in TOT_employeeID:
    driver.get("https://guided-coaching-dub.corp.amazon.com/api/employee-location-svc/GetLastSeenLocationOfEmployee?employeeId=" + x)
    print("Getting Location data for EmployeeID: " + x)
    locData = driver.find_element(By.TAG_NAME, 'body').text
    aaData = str(locData)
    realLoc = [aaData.split('"')]
    employee_Locations.extend(realLoc)
            
# Write to excel - Try this as a separate step
with open("Z:\\_Tracker\\Dump\\attendance\\Location.csv","a") as f:
    writer = csv.writer(f, delimiter='\n')
    writer.writerow(employee_Locations)
            
print("Employee Location data downloaded...")

You may see some performance gains by collecting all your information first, then writing to your CSV file

jrynes
  • 187
  • 1
  • 9
  • Hi @jrynes! Thank you for your answer. This have not made any change in the time that it takes to write all the data. It's just 300kb's in size but still.. 10 minutes it's quite a lot of time for that. At this point, I m out of ideas. – Alexandru Tudorie Oct 12 '22 at 21:58
  • You may be able to try creating multiple threads with Selenium, as described in the link below: https://stackoverflow.com/questions/68988489/how-to-run-selenium-chromedriver-in-multiple-threads In short, collecting your online information in a multithreaded way may allow you to speed up that step As Jason Baker commented above, using a separate library like Pandas, and a Pandas dataframe may help a bit - But it sounds like there may be another section of your code that is slow Do you have any data available for how long it takes to run each step in your code (read, write, etc..)? – jrynes Oct 17 '22 at 06:31
0

Something like this.

import concurrent.futures


def process_data(data: pd.DataFrame) -> None:
    associates = data['Employee ID'].unique()
    with concurrent.futures.ProcessPoolExecutor() as executer:
        executer.map(get_location, associates)


def get_location(associate: str) -> None:
    driver.get(
        "https://guided-coaching-dub.corp.amazon.com/api/employee-location-svc/GetLastSeenLocationOfEmployee?"
        f"employeeId={associate}")
    print(f"Getting Location data for EmployeeID: {associate}")
    realLoc = str(driver.find_element(By.TAG_NAME, 'body').text).split('"')

    with open("Z:\\_Tracker\\Dump\\attendance\\Location.csv", "a") as f:
        writer = csv.writer(f)
        writer.writerow(realLoc)


if __name__ == "__main__":
    data = read_csv("Z:\\_Tracker\\Dump\\attendance\\TOT.csv")
    process_data(data)
Jason Baker
  • 3,170
  • 2
  • 12
  • 15
  • Thank you for your response! For whatever reason, this does not write anything to the Location.csv – Alexandru Tudorie Oct 12 '22 at 23:49
  • did you call the function and pass the dataframe as the agrgument?process_data() – Jason Baker Oct 13 '22 at 00:09
  • I am confused about the argument, can you walk me trough please? – Alexandru Tudorie Oct 13 '22 at 00:39
  • data = read_csv("Z:\\_Tracker\\Dump\\attendance\\TOT.csv") is the agument. Updated above ^ Not sure how writing to the CSV file will react to multiprocessing appends though. – Jason Baker Oct 13 '22 at 00:45
  • RuntimeError: An attempt has been made to start a new process before the current process has finished its bootstrapping phase. This probably means that you are not using fork to start your child processes and you have forgotten to use the proper idiom in the main module: if __name__ == '__main__': freeze_support() ... The "freeze_support()" line can be omitted if the program is not going to be frozen to produce an executable. – Alexandru Tudorie Oct 13 '22 at 00:58
  • It does everything I have in the script from 5 to 10 times, once it gets to excel, it crashes. – Alexandru Tudorie Oct 13 '22 at 00:58
  • Yeah thought writing to csv might not like it. You'll have to do as @jrynes suggested only write to csv once. So have multiprocessing accumulate all the data then write to excel. – Jason Baker Oct 13 '22 at 01:04
  • That way, or the way I have it by default woult take the same amount of time, unfortunately. Is there any way we can chat privately? I have a question about one of your scripts that you have on Wiki. – Alexandru Tudorie Oct 13 '22 at 01:22
  • yep send me a chat – Jason Baker Oct 13 '22 at 01:29
  • Don't have access to chat here as I do not have 20 reputation. – Alexandru Tudorie Oct 13 '22 at 01:33
  • https://chat.stackoverflow.com/rooms/248774/room-for-jason-baker-and-frederick-zhang – Jason Baker Oct 13 '22 at 01:35
  • Yeah, I have access to read your messages but not to write as I m under 20 points of reputation, unfortunately. Any other options? – Alexandru Tudorie Oct 13 '22 at 01:41