1

I have this code that works great and does what I want, however it does it in linear form which is way to slow for the size of my data files so I want to convert it to Log. I tried this code and many others posted here but still no luck at getting it to work. I will post both sets of code and give examples of what I expect.

import pandas
import fileinput

'''This code runs fine and does what I expect removing duplicates from big 
file that are in small file, however it is a linear function.'''

with open('small.txt') as fin:
    exclude = set(line.rstrip() for line in fin)
    for line in fileinput.input('big.txt', inplace=True):
        if line.rstrip() not in exclude:
            print(line, end='')
        else:
            print('')

'''This code is my attempt at conversion to a log function.'''


def log_search(small, big):
    first = 0
    last = len(big.txt) - 1
    while first <= last:
        mid = (first + last) / 2
        if str(mid) == small.txt:
            return True
        elif small.txt < str(mid):
            last = mid - 1
        else:
            first = mid + 1
    with open('small.txt') as fin:
        exclude = set(line.rstrip() for line in fin)
        for line in fileinput.input('big.txt', inplace=True):
            if line.rstrip() not in exclude:
                print(line, end='')
            else:
                print('')
            return log_search(small, big)
  1. big file has millions of lines of int data.
  2. small file has hundreds of lines of int data.
  3. compare data and remove duplicated data in big file but leave line number blank.

running the first block of code works but it takes too long to search through the big file. Maybe I am approaching the problem in a wrong way. My attempt at converting it to log runs without error but does nothing.

cyberzyme
  • 13
  • 6
  • Not really clear. Do you want convert lots of numbers to binary, or do you want to perform a binary search to find matches? For the latter, try the `bisect` module. – tobias_k Feb 25 '19 at 15:46
  • Isn't bisect for inserting data? What I want to do is use a logarithmic search that compares data in the small file to the data in the big file, removing that data from big file and replace it with a blank line. because the big file has millions of lines a linear search would take way to long. – cyberzyme Feb 25 '19 at 16:10
  • Or maybe I misunderstood you. Do you want to binary-search the lines from the small file inside the big file, i.e. convert O(b) to O(s logb) (s and b being the size of the small and big file respectively)? I don't think that this will work, as seeking the next line in the big file will probably still be O(b) unless you store it in a list first, which will also be O(b). – tobias_k Feb 25 '19 at 16:33
  • It seems like [`file.seek` is O(1)](https://stackoverflow.com/q/51801213/1639625) (at least on some systems?), but that will give you the ith character, not the ith line. You might still be able to use that to read the next line from that position and binary-search on the characters istead of the lines. Not sure about the overwriting-lines-with-blank-lines part, though. – tobias_k Feb 25 '19 at 16:36
  • Ok I see the confusion. big list is an ordered list of integer data. Look at my code. I want to take each item in the small list and compare it to the data at the mid point of the big list. If it is the same remove it leave blank line. If not same see if it is smaller or larger then eliminate looking in the 50% of the big file where it can not be, then do it again split remaining 50% and look at mid point. repeat until I find the duplicate remove it leave blank line. Then repeat that process for all the rest of the data in the small file until I remove all of the duplicates in the big file. – cyberzyme Feb 25 '19 at 16:55
  • This is a logarithmic search. although I wrote code to do that and it runs without a error it is not doing what i want. it is just running and doing nothing. However if I use the code at the top block that is the same as the code in the bottom block of the log code it runs and does what i want but not Log it runs linear and it does do what I want bit takes way to long, hence the need to make it a log function. – cyberzyme Feb 25 '19 at 17:00
  • So my problem is how to convert that linear function into a log function and have it do what I want. – cyberzyme Feb 25 '19 at 17:02
  • I know how binary search works, but your `binary_search` method is not valid Python code. What is `small.txt` and `big.txt` supposed to be, the current line, or the entire file? I think in order for the binary search to work, you will first have to load the entire big file into a random-access data structure, e.g. a list (unless you get that `file.seek` trick to work), and that alone will be O(b) – tobias_k Feb 25 '19 at 17:05
  • Sorry that is a typo. they are files that have the data and are just test files with 3 lines of data in the small file and 5 lines of data in the big file and 3 lines of that data in the big file is the same lines of data in the small file. So what should happen after running the code is the big file should still have 5 lines but the 3 lines that are duplicates will be blank lines. – cyberzyme Feb 25 '19 at 17:14
  • Oh and I should have told you while I have programmed before in the 70's and early 80's I have not coded in years and have just started to learn Python 2 weeks ago – cyberzyme Feb 25 '19 at 17:15
  • Can you please fix the code? The way it is now, it will certainly not "run without error". Also, is the `with open(...)` really supposed to be _inside_ the `log_search` function? Probably not... – tobias_k Feb 25 '19 at 19:09
  • That is one of my problems being new to python I don't know how to fix it. I tried many times with no success. If you can fix it I would be great full, Back in my day 70's and eairly 80's when I coded we were mostly using BASIC and some Fortran or Cobol, Dinosaur stuff. I know the principals and concepts are the same like loops, functions, keywords etc but "skinning the cat" is a whole new ball of wax for me. One that I have to relearn. – cyberzyme Feb 25 '19 at 20:43
  • I can't fix it if I don't know how it's supposed to look. Can you confirm that the `with` should be _outside_ of the function? And as I said, just fixing a regular binary search algorithm won't solve your problem as you still have to read the entire file into a list to read it. As I said, it might also work with `file.seek`, but I don't know how you'd then delete those lines. – tobias_k Feb 25 '19 at 20:48
  • ok let me work on that for a day and I will see what I can do. I will try and do my best by tomorrow late afternoon. I have other things I have to work on that is taking up a lot of my time. Thank you for the help – cyberzyme Feb 25 '19 at 21:16
  • Maybe this will help. Copy and past the code block at the top into your IDE and create two files.txt small and big with 8 lines of ints in the big file and 3 lines of ints in the small file. But have 3 of the lines in the big file match the 3 lines of ints in the small file. run the code and see what it does. That is what I want it to do. After that I just want to change that or rewrite the code so I can run it on very large files with millions of lines. It may be the case that I may not be doing it right in my methods and need to scrap my log code and start over. what do you think. – cyberzyme Feb 25 '19 at 21:33

1 Answers1

0

I don't think there is a better or faster way to do this that what you are currently doing in your first approach. (Update: There is, see below.) Storing the lines from small.txt in a set and iterating the lines in big.txt, checking whether they are in that set, will have complexity of O(b), with b being the number of lines in big.txt.

What you seem to be trying is to reduce this to O(s*logb), with s being the number of lines in small.txt, by using binary search to check for each line in small.txt whether it is in big.txt and removing/overwriting it then.

This would work well if all the lines were in a list with random access to any array, but you have just the file, which does not allow random access to any line. It does, however, allow random access to any character with file.seek, which (at least in some cases?) seems to be O(1). But then you will still have to find the previous line break to that position before you can actually read that line. Also, you can not just replace lines with empty lines, but you have to overwrite the number with the same number of characters, e.g. spaces.

So, yes, theoretically it can be done in O(s*logb), if you do the following:

  • implement binary search, searching not on the lines, but on the characters of the big file
    • for each position, backtrack to the last line break, then read the line to get the number
    • try again in the lower/upper half as usual with binary search
  • if the number is found, replace with as many spaces as there are digits in the number
  • repeat with the next number from the small file

On my system, reading and writing a file with 10 million lines of numbers only took 3 seconds each, or about 8 seconds with fileinput.input and print. Thus, IMHO, this is not really worth the effort, but of course this may depend on how often you have to do this operation.


Okay, so I got curious myself --and who needs a lunch break anyway?-- so I tried to implement this... and it works surprisingly well. This will find the given number in the file and replace it with an accordant number of - (not just a blank line, that's impossible without rewriting the entire file). Note that I did not thoroughly test the binary-search algorithm for edge cases, off-by-one erros etc.

import os

def getlineat(f, pos):
    pos = f.seek(pos)
    while pos > 0 and f.read(1) != "\n":
        pos = f.seek(pos-1)
    return pos+1 if pos > 0 else 0

def bsearch(f, num):
    lower = 0
    upper = os.stat(f.name).st_size - 1
    while lower <= upper:
        mid = (lower + upper) // 2
        pos = getlineat(f, mid)
        line = f.readline()
        if not line: break # end of file
        val = int(line)
        if val == num:
            return (pos, len(line.strip()))
        elif num < val:
            upper = mid - 1
        elif num > val:
            lower = mid + 1 
    return (-1, -1)

def overwrite(filename, to_remove):
    with open(filename, "r+") as f:
        positions = [bsearch(f, n) for n in to_remove]
        for n, (pos, length) in sorted(zip(to_remove, positions)):
            print(n, pos)
            if pos != -1:
                f.seek(pos)
                f.write("-" * length)

import random
to_remove = [random.randint(-500, 1500) for _ in range(10)]
overwrite("test.txt", to_remove)

This will first collect all the positions to be overwritten, and then do the actual overwriting in a second stes, otherwise the binary search will have problems when it hits one of the previously "removed" lines. I tested this with a file holding all the numbers from 0 to 1,000 in sorted order and a list of random numbers (both in- and out-of-bounds) to be removed and it worked just fine.

Update: Also tested it with a file with random numbers from 0 to 100,000,000 in sorted order (944 MB) and overwriting 100 random numbers, and it finished immediately, so this should indeed be O(s*logb), at least on my system (the complexity of file.seek may depend on file system, file type, etc.).

The bsearch function could also be generalized to accept another parameter value_function instead of hardcoding val = int(line). Then it could be used for binary-searching in arbitrary files, e.g. huge dictionaries, gene databases, csv files, etc., as long as the lines are sorted by that same value function.

tobias_k
  • 81,265
  • 12
  • 120
  • 179
  • I don't know why but I am getting this error. line 24 elif num < val: ^ SyntaxError: invalid syntax – cyberzyme Feb 27 '19 at 01:28
  • I tried altering it but that error keeps popping up – cyberzyme Feb 27 '19 at 01:29
  • @cyberzyme There is certainly no syntax error in the code as provided. The `^` points to an error in the line above. Did you change anything there, remove the `,` or anything like that? – tobias_k Feb 27 '19 at 08:52
  • Ah Ha I did copy paste and I don't know what version of Python you use but I am using 3.7 and you are right in the line above I also got an error message telling me to remove redundant Parenthesis. When I did the error message went away and the code ran fine. Thank you for the help. – cyberzyme Feb 27 '19 at 13:06