0

I have a text file with say, 100 lines, I want to randomly segment these lines into 80-20 lines into two separate text files, Using the code below but it's not doing proper partition. I am getting a different number of files. I should get 80 lines in file2 and 20 files in file1.

Can someone point out the error and suggest if there is a better way. Please note in total.txt is the original file which needs to be segmented into file1 and file 2.

def partition(l, pred):
    fid_train=open('meta/file1.txt','w')
    fid_test = open('meta/file2.txt','w')
    for e in l:
        if pred(e):
            fid_test.write(e)
        else:
            #fid_train.write(e+'\n')
            fid_train.write(e)
    return fid_train,fid_test

lines = open("meta/total_list.txt").readlines()
lines1, lines2 = partition(lines, lambda x: random.random() < 0.2)        
Sam Mason
  • 15,216
  • 1
  • 41
  • 60

1 Answers1

0

Given that you only have 100 lines and you apparently want an exact 80/20 split, I'd suggest just shuffling and writing out the number of lines you want. Something like:

import random

# read everything in
lines = open("meta/total_list.txt").readlines()

# randomise order
random.shuffle(lines)

# split array up and write out according to desired proportions
open('meta/file1.txt', 'w').writelines(lines[:20])
open('meta/file2.txt', 'w').writelines(lines[20:])

Note that many libraries provide similar functionality, e.g. scikit-learn provides train_test_split.

Your original way you were doing it would result in a binomial draw, and would likely give you between 12 and 28 lines in file1. You can calculate this analytically via:

from scipy.stats import binom

binom.ppf([0.025, 0.975], 100, 0.2)
Sam Mason
  • 15,216
  • 1
  • 41
  • 60
  • its not working for me, I don't know why for text files its throwing an error: 'list' object has no attribute 'shuffle' The total_list is a text file, so I am using 'lines = open("meta/total_list.txt").readlines() lines.shuffle()' – KRISHNA CHAUHAN Apr 29 '21 at 04:30
  • @KRISHNACHAUHAN sorry, I misremembered where the shuffle method lives, have updated the answer using the random module – Sam Mason Apr 29 '21 at 08:14
  • Thank you so much sir, @Sam Mason, I wish I could learn to use train_test_split in this situation incase of text files. – KRISHNA CHAUHAN Apr 29 '21 at 16:15
  • @KRISHNACHAUHAN see https://stackoverflow.com/a/55442136/1358308 in your case you can just do: `file1, file2 = train_test_split(lines, test_size=0.2)` and you'll get lists of the appropriate length in `file1` and `file2` – Sam Mason Apr 29 '21 at 16:20
  • sir I have done this before but my dataloader is reading files location from these train test text files only. and here I am getting lists, is there any way to write into text files using this function – KRISHNA CHAUHAN Apr 30 '21 at 05:32
  • erm, using `open(...).writelines(...)` as normal... – Sam Mason Apr 30 '21 at 08:45