3

I'm new to Python (in that I learned it through a CodeAcademy course) and could use some help with figuring this out.

I have a file, 'TestingDeleteLines.txt', that's about 300 lines of text. Right now, I'm trying to get it to print me 10 random lines from that file, then delete those lines.

So if my file has 10 lines:

Carrot
Banana
Strawberry
Canteloupe
Blueberry
Snacks
Apple
Raspberry
Papaya
Watermelon

I need it to randomly pick out from those lines, tell me it's randomly picked blueberry, carrot, watermelon, and banana, and then delete those lines.

The issue is, when Python reads a file, it reads that file and once it gets to the end, it won't go back and delete the lines. My current thinking was that I could write the lines to a list, then reopen the file, match the list to the text file, and if it finds a match, delete the lines.

My current problem is twofold:

  1. It's duplicating the random elements. If it picks a line, I need it to not pick that same line again. However, using random.sample doesn't seem to work, as I need those lines separated out when I later use each line to append to a URL.
  2. I don't feel like my logic (write to array->find matches in text file->delete) is the most ideal logic. Is there a better way to write this?

    import webbrowser
    import random
    
    """url= 'http://www.google.com'
    webbrowser.open_new_tab(url+myline)""" Eventually, I need a base URL + my 10 random lines opening in each new tab
    
    def ShowMeTheRandoms():
        x=1
        DeleteList= []
        lines=open('TestingDeleteLines.txt').read().splitlines()
    for x in range(0,10):
        myline=random.choice(lines)
        print(myline) """debugging, remove later"""
        DeleteList.append(myline)
        x=x+1
        print DeleteList """debugging, remove later"""
    ShowMeTheRandoms()
    
wjandrea
  • 28,235
  • 9
  • 60
  • 81
Sam W
  • 79
  • 2
  • 8
  • 3
    The way to do this is to open the file, read in all the lines with `readlines()`, close the file, then rewrite the entire file. – Morgan Thrapp Sep 25 '15 at 18:21
  • How do I tell it to just delete the random lines though? – Sam W Sep 25 '15 at 18:31
  • [```file_object.seek(0)```](https://docs.python.org/3/library/io.html#io.TextIOBase.seek) should let you start iterating from the begining again. In your example, ```lines``` looks like it is a *file_object*. – wwii Sep 25 '15 at 18:39

6 Answers6

4

Point is: you dont "delete" from a file, but rewrite the whole file (or another one) with new content. The canonical way is to read the original file line by line, write back the lines you want to keep to a temporary file, then replace the old file with the new one.

with open("/path/to/source.txt") as src, open("/path/to/temp.txt", "w") as dest:
    for line in src:
        if should_we_keep_this_line(line):
            dest.write(line)
os.rename("/path/to/temp.txt", "/path/to/source.txt")
bruno desthuilliers
  • 75,974
  • 6
  • 88
  • 118
  • So instead of writing the random lines to an array, I should write all the other non-random lines to the array and create a new file? – Sam W Sep 25 '15 at 18:47
  • Why use an array (FWIW in Python it's a `list` not `array`) at all ? Read a line from the source, decide if you want to keep it, if yes write it to the temp file, lather rince repeat. – bruno desthuilliers Sep 26 '15 at 06:14
3

I have a file, 'TestingDeleteLines.txt', that's about 300 lines of text. Right now, I'm trying to get it to print me 10 random lines from that file, then delete those lines.

#!/usr/bin/env python
import random

k = 10
filename = 'TestingDeleteLines.txt'
with open(filename) as file:
    lines = file.read().splitlines()

if len(lines) > k:
    random_lines = random.sample(lines, k)
    print("\n".join(random_lines)) # print random lines

    with open(filename, 'w') as output_file:
        output_file.writelines(line + "\n"
                               for line in lines if line not in random_lines)
elif lines: # file is too small
    print("\n".join(lines)) # print all lines
    with open(filename, 'wb', 0): # empty the file
        pass

It is O(n**2) algorithm that can be improved if necessary (you don't need it for a tiny file such as your input)

Community
  • 1
  • 1
jfs
  • 399,953
  • 195
  • 994
  • 1,670
  • As a beginner coder, this was super easy to read and understand, so I thank you very much. :) Now, the issue I'm having is that it's throwing a syntax error on the elif line if I put it into a function. Do you have any ideas as to why that might be? – Sam W Sep 28 '15 at 22:07
  • @SamW: my guess, you broke the code indentation (make sure you don't mix tabs and spaces for indentation, use either or, not both) but I can't be sure if you don't show the *exact* code: [create a minimal but complete code example](http://stackoverflow.com/help/mcve) that demonstrates the issue and add it to your question (or ask a new one if you think the error might be interesting to somebody else). – jfs Sep 28 '15 at 22:20
  • 1
    Oh my god, duh! Thank you SO much, this was insanely helpful, and I learned a lot. :) I really sincerely appreciate you taking the time to write that out. – Sam W Sep 29 '15 at 15:59
3

To choose a random line from a file, you could use a space efficient single-pass reservoir-sampling algorithm. To delete that line, you could print everything except the chosen line:

#!/usr/bin/env python3
import fileinput

with open(filename) as file:
    k = select_random_it(enumerate(file), default=[-1])[0]

if k >= 0: # file is not empty
    with fileinput.FileInput(filename, inplace=True, backup='.bak') as file:
        for i, line in enumerate(file):
            if i != k: # keep line
                print(line, end='') # stdout is redirected to filename

where select_random_it() implements the reservoir-sampling algorithm:

import random

def select_random_it(iterator, default=None, randrange=random.randrange):
    """Return a random element from iterator.

    Return default if iterator is empty.
    iterator is exhausted.
    O(n)-time, O(1)-space algorithm.
    """
    # from https://stackoverflow.com/a/1456750/4279
    # select 1st item with probability 100% (if input is one item, return it)
    # select 2nd item with probability 50% (or 50% the selection stays the 1st)
    # select 3rd item with probability 33.(3)%
    # select nth item with probability 1/n
    selection = default
    for i, item in enumerate(iterator, start=1):
        if randrange(i) == 0: # random [0..i)
            selection = item
    return selection

To print k random lines from a file and delete them:

#!/usr/bin/env python3
import random
import sys

k = 10
filename = 'TestingDeleteLines.txt'
with open(filename) as file:
    random_lines = reservoir_sample(file, k) # get k random lines

if not random_lines: # file is empty
    sys.exit() # do nothing, exit immediately

print("\n".join(map(str.strip, random_lines))) # print random lines
delete_lines(filename, random_lines) # delete them from the file

where reservoir_sample() uses the same algorithm as select_random_it() but allows to choose k items instead of one:

import random

def reservoir_sample(iterable, k,
                     randrange=random.randrange, shuffle=random.shuffle):
    """Select *k* random elements from *iterable*.

    Use O(n) Algorithm R https://en.wikipedia.org/wiki/Reservoir_sampling

    If number of items less then *k* then return all items in random order.
    """
    it = iter(iterable)
    if not (k > 0):
        raise ValueError("sample size must be positive")

    sample = list(islice(it, k)) # fill the reservoir
    shuffle(sample)
    for i, item in enumerate(it, start=k+1):
        j = randrange(i) # random [0..i)
        if j < k:
            sample[j] = item # replace item with gradually decreasing probability
    return sample

and delete_lines() utility function deletes chosen random lines from the file:

import fileinput
import os

def delete_lines(filename, lines):
    """Delete *lines* from *filename*."""
    lines = set(lines) # for amortized O(1) lookup
    with fileinput.FileInput(filename, inplace=True, backup='.bak') as file:
        for line in file:
            if line not in lines:
                print(line, end='')
    os.unlink(filename + '.bak') # remove backup if there is no exception

reservoir_sample(), delete_lines() funciton do not load the whole file into memory and therefore they can work for arbitrary large files.

Community
  • 1
  • 1
jfs
  • 399,953
  • 195
  • 994
  • 1,670
1

What about list.pop - it gives you the item and update the list in one step.

lines = readlines()
deleted = []

indices_to_delete = random.sample(xrange(len(lines)), 10)

# sort to delete biggest index first 
indices_to_delete.sort(reverse=True)

for i in indices_to_delete:
    # lines.pop(i) delete item at index i and return the item
    # do you need it or its index in the original file than
    deleted.append((i, lines.pop(i)))

# write the updated *lines* back to the file or new file ?!
# and you have everything in deleted if you need it again
rebeling
  • 718
  • 9
  • 31
  • My initial question wasn't as precise as maybe it should have been. I need it to randomly choose lines from a file, tell me what those lines say, and then delete the lines. – Sam W Sep 25 '15 at 20:56
  • @SamW The deleted lines are in the variable `deleted`, and the remaining lines are still in `lines`. What else do you need? – Brent Washburne Sep 25 '15 at 22:33
  • why do you need to sort indices here? (`line.pop(i)` is `O(n)` either way) – jfs Sep 25 '15 at 23:28
  • @J.F.Sebastian just to prevent an IndexError: pop index out of range – rebeling Sep 26 '15 at 17:53
  • that makes sense. I might have been thinking about `for i in choices: items.remove(i)` from [@Josh Trii Johnston's answer](http://stackoverflow.com/a/32788895/4279) – jfs Sep 26 '15 at 17:58
1

Lets assume you have a list of lines from your file stored in items

>>> items = ['a', 'b', 'c', 'd', 'e', 'f']
>>> choices = random.sample(items, 2)  # select 2 items
>>> choices  # here are the two
['b', 'c']
>>> for i in choices:
...   items.remove(i)
...
>>> items  # tee daa, no more b or c
['a', 'd', 'e', 'f']

From here you would overwrite your previous text file with the contents of items joining with your preferred line ending \r\n or \n. readlines() does not strip line endings so if you use that method, you do not need to add your own line endings.

Josh J
  • 6,813
  • 3
  • 25
  • 47
  • "joining with your preferred line ending \r\n or \n" is wrong, because readlines list items contain newline at the end ...it would add extra blank lines – rebeling Sep 29 '15 at 10:17
  • @rebeling oversight on my part. I will edit accordingly. – Josh J Sep 29 '15 at 13:11
0

Maybe you could try generating 10 random numbers from 0 to 300 using

deleteLineNums = random.sample(xrange(len(lines)), 10)

and then delete from the lines array by making a copy with list comprehensions:

linesCopy = [line for idx, line in enumerate(lines) if idx not in deleteLineNums]
lines[:] = linesCopy

And then writing lines back to 'TestingDeleteLines.txt'.

To see why the copy code above works, this post might be helpful:

Remove items from a list while iterating

EDIT: To get the lines at the randomly generated indices, simply do:

actualLines = []
for n in deleteLineNums:
    actualLines.append(lines[n])

Then actualLines contians the actual line text of the randomly generated line indices.

EDIT: Or even better, use a list comrehension:

actualLines = [lines[n] for n in deleteLineNums]
Community
  • 1
  • 1
sgrg
  • 1,210
  • 9
  • 15
  • How am I connecting that to my original random lines? 'for x in range(0,10): myline=random.choice(lines) print(myline)' So say that pulls out "carrots, banana, apple". I want to now delete those exact same lines. If I add deleteLineNums = random.sample(xrange(len(lines)), 10), that just gives me a list of numbers, but those numbers don't correspond with the random lines I pulled already. Am I misunderstanding something? – Sam W Sep 25 '15 at 20:36
  • So in this case you would be identifying random line indices to delete, not lines themselves. Note that since the lines are being selected at random in both cases, the two approaches are equivalent in identifying 10 random lines from the file. EDIT: (So you would be doing this in place of selecting the actual lines at random, and that replacement gives you logically the same result) Does that make sense? – sgrg Sep 25 '15 at 20:39
  • Ah, that does clarify things. The thing is, I need the actual text from the lines, because later I'm using the text from those lines to add to the end of a URL string. So I need to know what line is at that index, then delete it. – Sam W Sep 25 '15 at 20:46