4

I'm trying to split text file of size 100 MB (having unique rows) into 10 files of equal size using python pysftp but I'm unable to find proper approach for same.

Please let me know how can I read/ split files from SFTP directory and place back all files to FTP directory itself.

with pysftp.Connection(host=sftphostname, username=sftpusername, port=sftpport, private_key=sftpkeypath) as sftp:
    with sftp.open(source_filedir+source_filename) as file:
        for line in file:

<....................Unable to decide logic------------------>
Martin Prikryl
  • 188,800
  • 56
  • 490
  • 992
mu shaikh
  • 83
  • 6

3 Answers3

2

The logic you probably need is as follows:

  1. As you are in a read only environment, you will need to download the whole file into memory.

  2. Use Python's io.StringIO() to handle the data in memory as if it is a file.

  3. As you are talking about rows, I assume you mean the file is in CSV format? You can make use of Python's csv library to parse the file.

  4. First do a quick scan of the file using a csv.reader(), use this to count the number of rows in the file. This can then be used to determine how to split the file into equal number of rows, rather than just splitting the file at set byte counts.

  5. Once you know the number of rows, reopen the data (as a file again) and just read the header row in. This can then be added to the first row of each split file you create.

  6. Now read n rows in (based on your total row count). Use a csv.writer() and another io.StringIO() to first write the header row and then write the split rows into memory. This can then be used to upload using pysftp to a new file on the server, all without requiring access to an actual filing system.

The result will be that each file will also have a valid header row.

Martin Evans
  • 45,791
  • 17
  • 81
  • 97
  • 1
    Thanks a lot Martin..... Yes, this has helped me and i'm able to write to SFTP from OCI - FaaS environment which happens to be read only environment. I have marked this as answer – mu shaikh Aug 11 '20 at 12:17
0

I don't think FTP / SFTP allow for something more clever than simply downloading the file. Meaning, you'd have to get the whole file, split it locally, then put the new files back.

For text file splitting logic I believe that this thread may be of use: Split large files using python

bensha
  • 69
  • 3
  • Thanks Bensha for your response. Problem here is i'm trying to run the Python code in Oracle Cloud Infrastructure - Functions (FaaS) which is a read only environment where i won't be able to download file from FTP to local environment. Only option i have is to split at runtime post reading from FTP. – mu shaikh Aug 11 '20 at 05:06
0

There is a library like filesplit you can use to split files. It has similar functionality like the Linux command split or csplit.

For you case

split text file of size 100 MB into 10 files of equal size

you can use method bysize:

import os
from filesplit.split import Split

infile = source_filedir + source_filename
outdir = source_filedir
split = Split(infile, outdir)  # construct the splitter


file_size = os.path.getsize(infile)
desired_parts = 10
bytes_per_split =  file_size / desired_parts  # have to calculate the size 

split.bysize(bytes_per_split)

For a line-partitioned split use bylinecount:

from filesplit.split import Split

split = Split(infile, outdir)
split.bylinecount(1_000_000)  # for a million lines each file 

See also:

Bonus

Since Python 3.6 you can use underscores in numeric literals (see PEP515): million = 1_000_000 to improve readability,

hc_dev
  • 8,389
  • 1
  • 26
  • 38