2

So I have an SFTP server that hosts a single CSV file that contains data about multiple courses. The data is in the following format (4 columns):

Activity Name,Activity Code,Completion Status,Full Name
Safety with Lasers, 3XX1, 10-Jul-20, "Person, Name"
Safety with Lasers, 3XX1, NaN, "OtherP, OtherName"
How to use wrench, 7NPA, 10-Aug-19, "OtherName, Person"
etc...

I am using Paramiko to access the file using the following code:

file = sftp.open('Data.csv')

But the issue I am having is that it is a SFTPFile type. How can I go about parsing the data from it? I need to extract the names of the courses, and keep track of how many people have completed it and not completed it. I am using the following code at the moment but it is horrendously slow. Any suggestions would be appreciated:

Courses = ['']
Total =[0]
Compl =[0]
csvreal = pandas.read_csv(file)
for index, row in csvreal.iterrows():
    string =(csvreal.loc[[index]].to_string(index=False, header=False))
    if(Courses[i] !=string.split('  ')[0]):
        i+=1
        Courses.append(string.split('  ')[0])
        Total.append(0)
        Compl.append(0)
    if(len(string.split('  ')[2])>3):  #Note that incomplete courses do not have completion date, so it is NaN
        Compl[i]+=1
    Total[i]+=1

I know it is very terrible, I'm new and have no idea what I am doing. Any advice on where to read up on relevant documentation or tutorials would be appreciated. Thank you!

Danny_ds
  • 11,201
  • 1
  • 24
  • 46
macosx136
  • 31
  • 2

2 Answers2

1

sftp.open opens the file on the remote server, so every read will take place over the network. This network traversal is very slow compared to reading from local disk. It would be more efficient to copy the file to your local machine using sftp.get, and then it can be read without incurring the overhead of traversing the network. If you need to update the file you can update the local copy and then copy back to the server with sftp.put.

The code would be something like this (untested, as I don't have an ftp server to hand):

# Retrieve a copy and open
myfile = sftp.get('Data.csv', 'local-copy-Data.csv')
csvreal = pandas.read_csv(myfile)

# Update remote
sftp.put('local-copy-Data.csv', 'Data.csv')
snakecharmerb
  • 47,570
  • 11
  • 100
  • 153
  • 1
    While true in general, Paramiko can buffer the reading. See [Reading file opened with Python Paramiko SFTPClient.open method is slow](https://stackoverflow.com/q/58433996/850848). – Martin Prikryl Apr 13 '20 at 10:20
  • @MartinPrikryl thanks, I didn't know that. I'd be happy to mark this as a duplicate of your answer, if the question asker agrees that your answer solves their issue. – snakecharmerb Apr 13 '20 at 10:28
-1

If you don't have a c complier installed you'll need one to do this. Tried to use some Cython to make things a little quicker:

import pandas as pd
import time

start_time = time.time()

csvreal = pd.read_csv("Data.csv")
Courses = ['']
cdef int Total[0]
cdef int Compl[0]

cdef int i = 0
for index, row in csvreal.iterrows():
    string =(csvreal.loc[[index]].to_string(index=False, header=False))
    if(Courses[i] !=string.split('  ')[0]):
        i+=1
        Courses.append(string.split('  ')[0])
        Total.append(0)
        Compl.append(0)
    if(len(string.split('  ')[2])>3):  #Note that incomplete courses do not have completion date, so it is NaN
        Compl[i]+=1
    Total[i]+=1

print("--- %s seconds ---" % (time.time() - start_time))

You'll need to install Cython with pip install Python and compile the new file with a setup.py file in the same directory. The link to the tutorial to do all of this is here: Cython Documentation

Hope this helps!

  • how does this make it faster? – macosx136 Apr 12 '20 at 02:02
  • Python is a dynamic language. It has to check for the type of every variable setevery time the variable is called. In this case the variables i, Total, and Compl are checked every time they are called in the for loop. Cython allows the python code to be complied using the C language. This allows for type setting (casting) of the variables. Python then does not have to check the types of all the variables as it runs through the for loop. I'm new to doing this and only got a speed up of 0.004 seconds. This will add up if you are running the loop more. – Globe-Eater Apr 13 '20 at 14:56