How to write a csv file via pandas and read it in R at regular intervals?

Question

Background

A driving simulator PC in my lab generates data that I receive via python socket. The data is generated every 1/60th of a second. I continuously save it to a csv file called position.csv. I also want to read position.csv in R to use in a shiny app. I read it every 0.2 seconds.

Problem

When I run the shiny app in R, python throws the PermissionError: [Errno 13] Permission denied: 'position.csv'

Python script for saving data to a csv file:

import socket
import struct
import pandas as pd

UDP_IP = "127.0.0.1"
UDP_PORT = 9000

sock = socket.socket(socket.AF_INET, # Internet
                     socket.SOCK_DGRAM) # UDP
sock.bind((UDP_IP, UDP_PORT))

while True:
    data, addr = sock.recvfrom(1024) # buffer size is 1024 bytes
    fields = struct.unpack_from('=ddd', data)
    print(fields[0],fields[1],fields[2])
    
    
    dict = {'y': fields[0], 'x': fields[1], 'z': fields[2]}
    my_data = pd.DataFrame([dict], columns=dict.keys())
    open("position.csv", "w")
    my_data.to_csv("position.csv", index=False)

R script for reading the csv file and use in an app:

library(shinydashboard)
library(dplyr)
library(ggplot2)
library(shiny)
library(data.table)



# ui----
ui <- dashboardPage(skin = "black", 
  dashboardHeader(title = "Dashboard"),
  dashboardSidebar(
    sidebarMenu(
      menuItem("Maps", tabName = "navigation", icon = icon("compass"))
  )),
  dashboardBody(
    tabItems(
      # First tab content
      tabItem(tabName = "navigation",
              fluidRow(
                tags$style(type="text/css", ".recalculating {opacity: 1.0;}"),
                plotOutput("plot1")
              )
      )
    )
  )
)


# server----
server <- function(input, output, session) {
 
  

  position <- reactivePoll(200, session,
                       # This function returns the time that log_file was last modified
                       checkFunc = function() {
                         if (file.exists("position.csv"))
                           file.info("position.csv")$mtime[1]
                         else
                           ""
                       },
                       # This function returns the content of log_file
                       valueFunc = function() {
                         data.table::fread("position.csv")
                       }
  )
  

  
  
  xl1 <-  reactive({position()$x - 1000})
  xl2 <-  reactive({position()$x + 1000})
  
  yl1 <-  reactive({position()$y - 800})
  yl2 <-  reactive({position()$y + 800})
  
  
  
 output$plot1 <- renderPlot({
    
    
    ggplot() +
      geom_point(data = position(),
                 aes(x, y),
                 color = "red", size = 5) +
      coord_equal(    xlim = c(xl1(), xl2()),
                      ylim = c(yl1(), yl2())) +
      theme_void()
    
  })
  
  
  
  
  cancel.onSessionEnded <- session$onSessionEnded(function() {

    stopApp()
  })
  
  cancel.onSessionEnded()
  

  
  
}

shinyApp(ui, server)

Question

How do I successfully read from and write to the position.csv file?

`shiny` recommendation: combine the four reactives for `xl1`, `xl2`, `yl1`, and `yl2` into a single. My rationale is that `$plot1` is ultimately depending on `position()` ***five times***, and you cannot always guarantee the order of execution, so it's not unreasonable to think that one change to `position()` will cause `ggplot(.)` to fire five times before it is caught up. Better yet, get rid of the four, and either calculate them within the `$plot1` reactive itself, or in the `reactivePoll(.., valueFunc=)`. — r2evans, Mar 03 '22 at 18:52
Another point: get rid of `open(...)` in python. The `pandas.to_csv` function takes care of this for you, and doing this might be hanging file-descriptors around (I don't know enough python internals to know its garbage-collection methods well enough). — r2evans, Mar 03 '22 at 19:08
Thanks for pointing that out. I have commented out `open()`. — umair durrani, Mar 03 '22 at 19:11

r2evans · Accepted Answer · 2022-03-04T12:04:34.677

Without even looking at the shiny portion of this, reaching out to the filesystem for a CSV file every 0.2 seconds has got to be a huge bottleneck and unlikely to be the best way to go with performance in mind.

The most likely reason you're getting permission denied may (I haven't tested) be due to file-locking, where pandas has temporarily locked the file while writing to it, and R is trying to read it too soon. Frankly, even if you were not getting a "denied" error, it is certainly feasible to try to read the file while it is mid-write, meaning incomplete data. There should be some coordination of the write and read events such that this cannot happen.

Some thoughts, without testing (but with experience in mutual-file-access):

Not Filesystem-based

One alternative is to use some form of streaming-data mechanism such as Redis. This can be a simple "topic" (fifo queue) or with a little more thought (and depending on your needs) a Pub/Sub setup. With this, pandas would push its new data to the topic or pubsub topic, and one (if a vanilla topic) or one-or-more (if pubsub) consumers would get the data in its entirety.

Advantages:

Not using a filesystem, so the biggest bottleneck will be network bandwidth, likely much lower latency than filesystems, and writes and reads are always atomic (meaning no read-while-being-written problems like you're facing);
With pub/sub, any client can start "late" and get all past data (if desired) without impacting any other consumers. Realize that "another consumer" may be just you monitoring things, it doesn't necessarily have to be a full-time processing program.

Disadvantages:

Requires Redis (or Apache Kafka or RabbitMQ or something similar) as a service somewhere on the network, the closer (topologically the better).
Requires a little more thought into the architecture of cooperation between pandas and R. This will reap benefits.

This is actually quite easy to do on your dev-computer using Docker: the Redis image is free and performs very well, I use it regularly for purposes similar to this. (Docker is not required, Redis installs just fine without it, over to you.)

(Python has redis-py, R has redux.)

Filesystem-based

If you must go with file-based, then you need to use a method that completely mitigates the risk of reading-while-writing. While file-writing is not atomic (which is why you have problems), file-renaming is. Write the file to a temporary file (on the same filesystem, but not in a place or with a name that R will read) and then, once written/closed, rename it so that R will see it.

For instance, let's assume that your convention is to use /some/path/file1234.csv where perhaps the 1234 increments with each write. (You might instead have time, it doesn't matter.) Let's restrict R so that it only sees files that end in the literal .csv (not difficult). In pandas, write to /some/path/file1234.csv.temp, and when complete (and you close() in python!), rename it to /some/path/file1234.csv. As soon as the file is renamed, R should be able to read it without distraction.

Advantages:

No change in architecture, likely the fastest to implement and test.

Disadvantages:

Still based on the filesystem, which means compounding latency from: network (if using a network filesystem), OS, HDD, etc.

If you're really curious, MailDir is a directory structure that I use for identical purposes, though it's working off of a huge GPFS (NFS-like) where latency of file-creation can be upwards of 10-15 seconds, file-locking is not supported (reliably), and without the file-renaming atomicity I mentioned above, I would be sunk. Surely you do not need the "complexity" (not much, but relatively more complex) of a maildir structure for passing files between pandas and R, but ... the premise of atomic file-renaming has precedence and has had a lot of people work on it. (Maildir scales very well, afaict. The only thing I haven't attempted to figure out yet is filesystem-based pubsub in maildir ...)

(Python and R both do file-renaming atomically, no non-standard modules/packages required.)

Possible change to python (untested):

+ import os

  # ...

      dict = {'y': fields[0], 'x': fields[1], 'z': fields[2]}
      my_data = pd.DataFrame([dict], columns=dict.keys())
-     open("position.csv", "w")
-     my_data.to_csv("position.csv", index=False)
+     my_data.to_csv("position.csv.temp", index=False)
+     try:
+       os.remove("position.csv")
+     except:
+       pass
+     os.rename("position.csv.temp", "position.csv")

It might be good enough to os.remove("position.csv") before renaming instead of using .old, I haven't tested to see what works best. I'm not as concerned about reading-processes, since on most systems the file itself (regardless of the inode on the filesystem) should allow R to continue reading even if the filename has been deleted. Again, not well tested.

Other Considerations

File format: while CSV is standard and easy, you may want to consider something that reads and writes faster such as feather. Both python and R have modules/packages that will support this. I don't have experience with it, but perhaps https://rstudio-pubs-static.s3.amazonaws.com/207316_edcc0ea0a7c04ea5a63833aaea7051fb.html is a useful start. (There's also "parquet", I have no experience with this either.)
Down-sampling: do you really need the data to be read every 0.2 seconds? Shiny (R!) is not going to guarantee reading every file right away, so there will be some "slop" here anyway. I recommend down-sampling to perhaps 1/sec or once per 2-3 seconds, depending on how you really intend to use it. I recognize some use-cases are not compatible with this recommendation, over to you.

Redis option:

Redis running on my laptop using the default port 6379. I'm using docker, so I started it with this. (Note that if you have special networking setup and/or not able to use host-mode for listening ports, this might need adjusting. It should just work, though, no other configuration required.)

$ docker run -p "6379:6379" --name some-redis -d redis

In python:

import pandas as pd
import redis
r = redis.Redis(host='localhost', port=6379, db=0)
df1 = pd.DataFrame(data={'col1': [1, 2], 'col2': [3, 4]})
df2 = pd.DataFrame(data={'col1': [11, 12], 'col2': [13, 14]})
r.rpush('carsim', df1.to_json(orient='records'))
r.rpush('carsim', df2.to_json(orient='records'))

In R:

R <- redux::hiredis()
popped <- R$LPOP("carsim")
popped
# [1] "[{\"col1\":1,\"col2\":3},{\"col1\":2,\"col2\":4}]"
jsonlite::fromJSON(popped)
#   col1 col2
# 1    1    3
# 2    2    4
popped <- R$LPOP("carsim")
jsonlite::fromJSON(popped)
#   col1 col2
# 1   11   13
# 2   12   14
popped <- R$LPOP("carsim")
popped
# NULL

Adaptation of maildir-like directories

This trick uses a premise of maildirs with a little bit of compromise (which should be fine so long as there is only one file-writer, the car-sim).