1

Hello i have huge csv file (1GB) that can be updated (server often add new value)

I want in python read this file line by line (not load all file in memory) and i want to read this in "real time"

this is example of my csv file :

id,name,lastname
1,toto,bob
2,tutu,jordan
3,titi,henri

in first time i want to get the header of file (columns name) in my example i want get this : id,name,lastname

and in second time, i want read this file line by line not load all file in memory

and in third time i want to try to read new value between 10 seconds (with sleep(10) for example)

i search actualy solution with use pandas i read this topic : Reading a huge .csv file

import pandas as pd
chunksize = 10 ** 8
for chunk in pd.read_csv(filename, chunksize=chunksize):
    process(chunk)

but i don't unterstand, 1) i don't know size of my csv file, how define chunksize ? 2) when i finish read, how says to pandas to try to read new value between 10 seconds (for example) ?

thanks for advance for your help

ilapasle
  • 349
  • 4
  • 16
  • Your question is unclear to me. Can you please define `read new value between 10 seconds` ? – jpp Jun 29 '18 at 08:07
  • 1
    I have a server that adds data to this csv file When the server adds this data to the csv file, I would like my python program to detect it and display it to me I think that the only solution is to make an infinite loop, in this loop, I propose to look every 10 seconds if there is new data 10 seconds is just an example, it can be 1 seconds, 1 minutes... – ilapasle Jun 29 '18 at 08:13

3 Answers3

2

First of all, 1GB is not huge - pretty much any modern device can keep that in its working memory. Second, pandas doesn't let you poke around the CSV file, you can only tell it how much data to 'load' - I'd suggest using the built-in csv module if you want to do more advanced CSV processing.

Unfortunately, the csv module's reader() will produce an exhaustible iterator for your file so you cannot just build it as a simple loop and wait for the next lines to become available - you'll have to collect the new lines manually and then feed them to it to achieve the effect you want, something like:

import csv
import time

filename = "path/to/your/file.csv"

with open(filename, "rb") as f:  # on Python 3.x use: open(filename, "r", newline="")
    reader = csv.reader(f)  # create a CSV reader
    header = next(reader)  # grab the first line and keep it as a header reference
    print("CSV header: {}".format(header))
    for row in reader:  # iterate over the available rows
        print("Processing row: {}".format(row))  # process each row however you want
    # file exhausted, entering a 'waiting for new data' state where we manually read new lines
    while True:  # process ad infinitum...
        reader = csv.reader(f.readlines())  # create a CSV reader for the new lines
        for row in reader:  # iterate over the new rows, if any
            print("Processing new row: {}".format(row))  # process each row however you want
        time.sleep(10)  # wait 10 seconds before attempting again

Beware of the edge cases that may break this process - for example, if you attempt to read new lines as they are being added some data might get lost/split (in dependence of the flushing mechanism used for addition), if you delete previous lines the reader might get corrupted etc. If possible at all, I'd suggest controlling the CSV writing process in such a way that it informs explicitly your processing routines.

UPDATE: The above is processing the CSV file line by line, it never gets loaded whole into the working memory. The only part that actually loads more than one line in memory is when an update to the file occurs where it picks up all the new lines because it's faster to process them that way and, unless you're expecting millions of rows of updates between two checks, the memory impact would be negligible. However, if you want to have that part processed line-by-line as well, here's how to do it:

import csv
import time

filename = "path/to/your/file.csv"

with open(filename, "rb") as f:  # on Python 3.x use: open(filename, "r", newline="")
    reader = csv.reader(f)  # create a CSV reader
    header = next(reader)  # grab the first line and keep it as a header reference
    print("CSV header: {}".format(header))
    for row in reader:  # iterate over the available rows
        print("Processing row: {}".format(row))  # process each row however you want
    # file exhausted, entering a 'waiting for new data' state where we manually read new lines
    while True:  # process ad infinitum...
        line = f.readline()  # collect the next line, if any available
        if line.strip():  # new line found, we'll ignore empty lines too
            row = next(csv.reader([line]))  # load a line into a reader, parse it immediately
            print("Processing new row: {}".format(row))  # process the row however you want
            continue  # avoid waiting before grabbing the next line
        time.sleep(10)  # wait 10 seconds before attempting again 
zwer
  • 24,943
  • 3
  • 48
  • 66
  • but reader if loop for row in reader: contains all data of my csv file, That's not what I want. I want to load 1 line at a time into memory – ilapasle Jun 29 '18 at 10:41
  • @ilapasle - No, it does not. In fact, the above **is** loading the data line by line instead of keeping the whole file in memory (unlike pandas), apart from the second, `f.readlines()` part (which stores in memory only the update to the file). Check the update above for the second part if it is that crucial to have it line by line even for the CSV file updates. – zwer Jun 29 '18 at 12:14
0

Chunk size is the number of lines it would read at once, so it doesn't depend on the file size. At the end of the file the for loop will end. The chunk size depends on the optimal size of data for process. In some cases 1GB is not a problem, as it can fit in memory, and you don't need chuncks. If you aren't OK with 1GB loaded at once, you can select for example 1M lines chunksize = 1e6, so with the line length about 20 letters that would be something less than 100M, which seems reasonably low, but you may vary the parameter depending on your conditions.

When you need to read updated file you just start you for loop once again.

If you don't want to read the whole file just to understand that it hasn't changed, you can look at it's modification time (details here). And skip reading if it hasn't changed.

If the question is about reading after 10 seconds it can be done in infinite loop with sleep like:

import time

while True:
    do_what_you_need()
    time.sleep(10)

In fact the period will be more that 10 seconds as do_what_you_need() also takes time.

Leonid Mednikov
  • 943
  • 4
  • 13
0

If the question is about reading the tail of a file, I don't know a good way to do that in pandas, but you can do some workarounds.

First idea is just to read file without pandas and remember the last position. Next time you need to read, you can use seek. Or you can try to implement the seek and read from pandas using StringIO as a source for pandas.read_csv

The other workaround is to use Unix command tail to cut last n lines, if you are sure there where added not too much at once. It will read the whole file, but it is much faster than reading and parsing all lines with pandas. Still seek is theretically faster on very long files. Here you need to check if there are too many lines added (you don't see the last processed id), in this case you'll need to get longer tail or read the whole file.

All that involves additional code, logic, mistakes. One of the them is that the last line could be broken (if you read at the moment it is being written). So the way I love most is just to switch from txt file to sqlite, which is an SQL compatable database which stores data in file and doesn't need a special process to access it. It has python library which make it easy to use. It will handle all the staff with long file, simultanious writing and reading, reading only the data you need. Just save the last processed id and make request like this SELECT * FROM table_name WHERE id > last_proceesed_id;. Well this is possible only if you also control the server code and can save in this format.

Leonid Mednikov
  • 943
  • 4
  • 13