4

I'm working with huge data CSV file. Each file contains milions of record , each record has a key. The records are sorted by thier key. I dont want to go over the whole file when searching for certian data. I've seen this solution : Reading Huge File in Python

But it suggests that you use the same length of lines on the file - which is not supported in my case.

I thought about adding a padding to each line and then keeping a fixed line length , but I'd like to know if there is a better way to do it.

I'm working with python

Community
  • 1
  • 1
RanZilber
  • 1,840
  • 4
  • 31
  • 42
  • @Mat- not an option right now. I've got a very limited deadline , and not enough time to create database from that data. – RanZilber Dec 03 '11 at 16:49
  • Do the binary search on byte level and after the seek find the nearest newline. – sleeplessnerd Dec 03 '11 at 16:57
  • http://stackoverflow.com/a/5942463/616486 sqlite seems to have an automatic csv import option. – sleeplessnerd Dec 03 '11 at 17:02
  • Nobody's binary search works properly first time. You could have had a database solution up and running by now. How many different large files do you have? What is a typical number of records? What is a typical file size in GB? Is the key a number or a string? – John Machin Dec 03 '11 at 23:23

3 Answers3

8

You don't have to have a fixed width record because you don't have to do a record-oriented search. Instead you can just do a byte-oriented search and make sure that you realign to keys whenever you do a seek. Here's a (probably buggy) example of how to modify the solution you linked to from record-oriented to byte-oriented:

bytes = 24935502 # number of entries
for i, search in enumerate(list): # list contains the list of search keys
  left, right = 0, bytes - 1 
  key = None
  while key != search and left <= right:
    mid = (left + right) / 2
    fin.seek(mid)
    # now realign to a record
    if mid:
        fin.readline()
    key, value = map(int, fin.readline().split())
    if search > key:
      left = mid + 1
    else:
      right = mid - 1
  if key != search:
    value = None # for when search key is not found
  search.result = value # store the result of the search
Gabe
  • 84,912
  • 12
  • 139
  • 238
2

To resolve it, you also can use binary search, but you need to change it a bit:

  1. Get the file size.
  2. Seek to the middle of size, with File.seek.
  3. And search the first EOL character. Then you find a new line.
  4. Check this line's key and if not what you want, update size and go to 2.

Here is a sample code:

fp = open('your file')
fp.seek(0, 2)
begin = 0
end = fp.tell()

while (begin < end):
    fp.seek((end + begin) / 2, 0)
    fp.readline()
    line_key = get_key(fp.readline())
    if (key == line_key):
        pass # find what you want
    elif (key > line_key):
        begin = fp.tell()
    else:
        end = fp.tell()

Maybe the code has bugs. Verify yourself. And please check the performance if you really want a fastest way.

user8738214
  • 3
  • 1
  • 3
Googol
  • 174
  • 4
1

The answer on the referenced question that says binary search only works with fixed-length records is wrong. And you don't need to do a search at all, since you have multiple items to look up. Just walk through the entire file one line at a time, build a dictionary of key:offset for each line, and then for each of your search items jump to the record of interest using os.lseek on the offset corresponding to each key.

Of course, if you don't want to read the entire file even once, you'll have to do a binary search. But if building the index can be amortized over several lookups, perhaps saving the index if you only do one lookup per day, then a search is unnecessary.

Dave
  • 3,834
  • 2
  • 29
  • 44