0

I have a text file, lets call it "input.txt." It has probably a billion different words and I'm trying to put everything together in a python script and make sure that it deleted everything that's duplicate and then create a new file lets call it "output.txt" which will show a clean version of input.txt without any duplicates.

I thought of using python because it feels so light and fast but I'm not sure if I will be able to manage so much data, about 1 billion of lines of random words.

At the same time I'm not sure if I should be using anything else or even how to start my python script. So far i'm here:

import os

with open("input.txt")

And I'm just stuck. I have never used Python for anything of that sort and I'm unsure on how to continue.

An example of what the input looks like:

red
red1
red2
red
RED@
ReD
RED@
red
ReD

and so on.. (random words but case sensitive)

and the desired output should be like:

red
red1
red2
RED@
ReD

Any help is highly appreciated.

Thanks!

Nora
  • 53
  • 7
  • Do you know any other lenguage? Use it. I love python, but for something like this it will be soooooooooooooooooooooooooooooooooooooooooooooooooooooooooo slow – bench Oct 08 '20 at 01:59
  • 1
    So is it unique per word, sentence etc? Can you give us an example of data before and after? – PacketLoss Oct 08 '20 at 01:59
  • Is it duplicates lines or duplicate words that you are trying to remove? – abRao Oct 08 '20 at 02:00
  • @bench: That's nonsense. This is an I/O bound problem; Python's processing being slower is largely irrelevant in that context. When I/O takes 100x longer than the theoretically optimal CPU processing written in some combination of C and assembly, Python taking 20x as long to do the CPU work doesn't matter; you're still only 20% slower than the theoretically optimal solution. – ShadowRanger Oct 08 '20 at 02:02
  • @PacketLoss Just updated the post! – Nora Oct 08 '20 at 02:02
  • @Nora I often find it is useful to consider naive approaches first on a small scale. What about using the `set` data structure, for example? Or what about sorting the input and looking for duplicate lines? Consider the runtime and storage complexity of the various solutions when you do so. Then, once you have something that works on a small scale, think again about how to scale and optimize for performance. – Tom Oct 08 '20 at 02:08
  • @Tom Totally agree, i'm currently trying to do this on a 10 line text file.. Just to make it work then work from there, my biggest issue is that it seems really difficult because i never used python to manipulate data or anything of that sort.. – Nora Oct 08 '20 at 02:10
  • If the order is not important, you can create 52 text files outa.txt outb.txt and so on 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ' with each file having words with specific alphabet and then run your program on 52 files in memory and then merge the files – Aaj Kaal Oct 08 '20 at 02:25
  • @Nora since this question is closed I can't provide a more thorough answer, but have a look at this simple two-liner: ` import sys; print(''.join(set(open(sys.argv[1]).readlines()))) ` This opens the file specified by argv[1], reads all of its lines into a list, passes that list to `set()` creating a `set` of unique lines, joins all of them back into a string, and prints the string. This probably wouldn't scale to billions of lines without a lot of memory. – Tom Oct 08 '20 at 15:54

0 Answers0