0

I tried to open a big .csv file in python to seperate each row and append the last x lines in a new list.

btcDatear = []
btcPricear = []
btcVolumear = []
howfarback = 20000
try:
    sourceCode = open('.btceUSD.csv', 'r')
    splitSource = sourceCode.split('\n')

        for eachline in splitSource[-howfarback:]:
            splitLine = eachline.split(',')
            btcDate = splitLine[0]
            btcPrice = splitLine[1]
            btcVolume = splitLine[2]

            btcDatear.append(float(btcDate))
            btcPricear.append(float(btcPrice))
            btcVolumear.append(float(btcVolume))


except Exception, e:
    print "failed raw data", str(e)

I succeed with a smaller file of 20 mb and this one is 700 mb so i think there is nothing wrong with my code. Is there a better way to make three separete lists of the three columns? I need the last x numbers. or could i remove the first 200.000 so my file is small enough to pass it through my code?

To do one of this things it has to be under +-3 minutes if it's possible.

henkaap
  • 13
  • 4
  • You could make your array x numbers large and write into into with the current line number modulo x. – usr1234567 Dec 29 '14 at 15:53
  • Don't try to keep all the data in a list, process it as you read it. And really don't do `sourceCode.split('\n')`, use a `for` loop. – Mark Ransom Dec 29 '14 at 15:53
  • Thanks, how could i do that? because i need the last x rows so i have to open the whole file to a list? – henkaap Dec 29 '14 at 15:56
  • why three separate lists when the contents belongs together. – Daniel Dec 29 '14 at 16:07
  • You can find out how many lines in a file using this http://stackoverflow.com/questions/845058/how-to-get-line-count-cheaply-in-python . Then reopen the file and loop thru all the lines you don't care about before processing your data. Do you know about the csv module in python? – joel goldstick Dec 29 '14 at 16:12

1 Answers1

2

You can't "split a file", but you can read it line by line no matter how big. E.g:

import collections

btcDatear = []
btcPricear = []
btcVolumear = []
howfarback = 20000
try:
    with open('.btceUSD.csv', 'r') as sourceCode:
        lastNlines = collections.deque(sourceCode, howfarback)
    for eachline in lastNlines:
        splitLine = eachline.split(',')
        btcDate = splitLine[0]
        btcPrice = splitLine[1]
        btcVolume = splitLine[2]

        btcDatear.append(float(btcDate))
        btcPricear.append(float(btcPrice))
        btcVolumear.append(float(btcVolume))
except Exception as e:
    print "failed raw data", str(e)

Building a deque with a maximum length of howfarback is the best way to keep the last N lines of a file that you can only read line by line from the start. The with statement ensures the file is properly closed no matter what; the rest of the logic is like in your code. It would be better to apply the standard library csv module, but, one bit of learning at a tie:-).

There may be tricks (subtly exploiting the fact that the CSV file is likely to be seekable) to get "the last N lines" faster -- in Unixy systems, the tail system command is very good at that. If the performance of this straightforward approach is too slow for you, ask again and we'll discuss that:-) [and/or how the csv module is best used...]

Added: come to think of it, no need to belabor "tail tricks", as they're well explained at Get last n lines of a file with Python, similar to tail -- the question is by a Python guru, Armin Ronacher, so you can be pretty confident of the quality of his code, and the answers and long discussion are interesting.

So if this simple approach takes too long, study Armin's and his respondents'... very tricky but can be truly useful.

So we might as well focus on the use of the csv module, after an import csv at the start to be sure -- rewriting only the changing part...:

    for fields in csv.reader(iter(lastNlines)):
        btcDate, btcPrice, btcVolume = fields[:3]

all the rest as before. csv.reader takes care of CSV parsing (you may not need the subtleties such as dealing with quoted/escaped commas but you pay no extra there!-) and leaves your code more concise and elegant.

Community
  • 1
  • 1
Alex Martelli
  • 854,459
  • 170
  • 1,222
  • 1,395
  • I didn't know `deque` could fill itself automatically like that. And I love the pirate hat. – Mark Ransom Dec 29 '14 at 16:12
  • @MarkRansom , yep, the pirate hat is the coolest-looking of the 11 hats SO awarded me during this "winter bash", so I'm wearing it for the festivities. And, the `maxlen` optional parameter to `deque` was a truly useful addition back in Python 2.6! – Alex Martelli Dec 29 '14 at 16:22
  • Thank you sooooooo much! it's working perfectly. It's great for now and i will look into the csv module. Thank you again and all the other people who posted amazing answers! – henkaap Dec 29 '14 at 16:32
  • thanks a lot for complete and kindly explain ;) , i think this answer is desire more up votes ! – Mazdak Dec 29 '14 at 16:45