Streaming data for pandas df

Question

I'm attempting to simulate the use of pandas to access a constantly changing file.

I have one file reading a csv file, adding a line to it then sleeping for a random time to simulate bulk input.

import pandas as pd
from time import sleep
import random

df2 = pd.DataFrame(data = [['test','trial']], index=None)

while True:
    df = pd.read_csv('data.csv', header=None)
    df.append(df2)
    df.to_csv('data.csv', index=False)
    sleep(random.uniform(0.025,0.3))

The second file is checking for change in data by outputting the shape of the dataframe:

import pandas as pd

while True:
    df = pd.read_csv('data.csv', header=None, names=['Name','DATE'])
    print(df.shape)

The problem with that is while I'm getting the correct shape of the DF, there are certain times where it's outputting (0x2).

i.e.:

...
(10x2)
(10x2)
...
(10x2)
(0x2)
(11x2)
(11x2)
...

This does occur at some but not between each change in shape (the file adding to dataframe).

Knowing this happens when the first script is opening the file to add data, and the second script is unable to access it, hence (0x2), will this occur any data loss?

I cannot directly access the stream, only the output file. Or are there any other possible solutions?

Edit

The purpose of this is to load the new data only (I have a code that does that) and do analysis "on the fly". Some of the analysis will include output/sec, graphing (similar to stream plot), and few other numerical calculations.

The biggest issue is that I have access to the csv file only, and I need to be able to analyze the data as it comes without loss or delay.

What is the goal essentially? Would something like watchdog to check for changes to the file maybe be a better approach? — Padraic Cunningham, Oct 06 '15 at 11:27
You could also implement a lock so only one process can open the file at a time, unix has various ways to do it http://stackoverflow.com/questions/29520587/checking-running-python-script-within-the-python-script/29522672#29522672. having one process reading and the other writing probably would not lose you any data but if you are using the data to test for changes you will get incorrect output — Padraic Cunningham, Oct 06 '15 at 11:52
watchdog seem like an interesting tool to use, but not what I'm looking for. I edited my question to explain more. — Leb, Oct 06 '15 at 14:59
It seems that the "reader" [your code] is accessing the csv file when the streamer hasn't finished writing the data. It's a race problem. — jprawiharjo, Oct 09 '15 at 04:27

score 3 · Answer 1 · answered Oct 30 '15 at 05:16

One of the scripts is reading the file while the other is trying to write to the file. Both scripts cannot access the file at the same time. Like Padraic Cunningham says in the comments you can implement a lock file to solve this problem.

There is a python package that will do just that called lockfile with documentation here.

Here is your first script with the lockfile package implemented:

import pandas as pd
from time import sleep
import random
from lockfile import FileLock

df2 = pd.DataFrame(data = [['test','trial']], index=None)
lock = FileLock('data.lock')

while True:
    with lock:
        df = pd.read_csv('data.csv', header=None)
        df.append(df2)
        df.to_csv('data.csv', index=False)
    sleep(random.uniform(0.025,0.3))

Here is you second script with the lockfile package implemented:

import pandas as pd
from time import sleep
from lockfile import FileLock

lock = FileLock('data.lock')

while True:
    with lock:
        df = pd.read_csv('data.csv', header=None, names=['Name','DATE'])
    print(df.shape)
    sleep(0.100)

I added a wait of 100ms so that I could slow down the output to the console.

These scripts will create a file called "data.lock" before accessing the "data.csv" file and delete the file "data.lock" after accessing the "data.csv" file. In either script, if the "data.lock" exists, the script will wait until the "data.lock" file no longer exists.

This won't work because the first script is a simulation of the file that's constantly being updated. In reality I can only change/edit the second script where I'm only reading the file — Leb, Oct 31 '15 at 01:19

score 1 · Answer 2 · answered Oct 31 '15 at 02:55

Your simulation script reads and writes to the data.csv file. You can read and write concurrently if one script opens the file as write only and the other opens the file as read only.

With this in mind, I changed your simulation script for writing the file to the following:

from time import sleep
import random

while(True):
    with open("data.csv", 'a') as fp:
        fp.write(','.join(['0','1']))
        fp.write('\n')
    sleep(0.010)

In python, opening a file with 'a' means append as write only. Using 'a+' will append with read and write access. You must make sure that the code writing the file will only open the file as write-only, and your script that is reading the file must never attempt to write to the file. Otherwise, you will need to implement another solution.

Now you should be able to read using your second script without the issue that you mention.

Streaming data for pandas df

2 Answers2

Linked