1

Background

A driving simulator PC in my lab generates data that I receive via python socket. The data is generated every 1/60th of a second. I continuously save it to a csv file called position.csv. I also want to read position.csv in R to use in a shiny app. I read it every 0.2 seconds.

Problem

When I run the shiny app in R, python throws the PermissionError: [Errno 13] Permission denied: 'position.csv'

Python script for saving data to a csv file:

import socket
import struct
import pandas as pd

UDP_IP = "127.0.0.1"
UDP_PORT = 9000

sock = socket.socket(socket.AF_INET, # Internet
                     socket.SOCK_DGRAM) # UDP
sock.bind((UDP_IP, UDP_PORT))

while True:
    data, addr = sock.recvfrom(1024) # buffer size is 1024 bytes
    fields = struct.unpack_from('=ddd', data)
    print(fields[0],fields[1],fields[2])
    
    
    dict = {'y': fields[0], 'x': fields[1], 'z': fields[2]}
    my_data = pd.DataFrame([dict], columns=dict.keys())
    open("position.csv", "w")
    my_data.to_csv("position.csv", index=False)

R script for reading the csv file and use in an app:

library(shinydashboard)
library(dplyr)
library(ggplot2)
library(shiny)
library(data.table)



# ui----
ui <- dashboardPage(skin = "black", 
  dashboardHeader(title = "Dashboard"),
  dashboardSidebar(
    sidebarMenu(
      menuItem("Maps", tabName = "navigation", icon = icon("compass"))
  )),
  dashboardBody(
    tabItems(
      # First tab content
      tabItem(tabName = "navigation",
              fluidRow(
                tags$style(type="text/css", ".recalculating {opacity: 1.0;}"),
                plotOutput("plot1")
              )
      )
    )
  )
)


# server----
server <- function(input, output, session) {
 
  

  position <- reactivePoll(200, session,
                       # This function returns the time that log_file was last modified
                       checkFunc = function() {
                         if (file.exists("position.csv"))
                           file.info("position.csv")$mtime[1]
                         else
                           ""
                       },
                       # This function returns the content of log_file
                       valueFunc = function() {
                         data.table::fread("position.csv")
                       }
  )
  

  
  
  xl1 <-  reactive({position()$x - 1000})
  xl2 <-  reactive({position()$x + 1000})
  
  yl1 <-  reactive({position()$y - 800})
  yl2 <-  reactive({position()$y + 800})
  
  
  
 output$plot1 <- renderPlot({
    
    
    ggplot() +
      geom_point(data = position(),
                 aes(x, y),
                 color = "red", size = 5) +
      coord_equal(    xlim = c(xl1(), xl2()),
                      ylim = c(yl1(), yl2())) +
      theme_void()
    
  })
  
  
  
  
  cancel.onSessionEnded <- session$onSessionEnded(function() {

    stopApp()
  })
  
  cancel.onSessionEnded()
  

  
  
}

shinyApp(ui, server)

Question

How do I successfully read from and write to the position.csv file?

umair durrani
  • 5,597
  • 8
  • 45
  • 85
  • 1
    `shiny` recommendation: combine the four reactives for `xl1`, `xl2`, `yl1`, and `yl2` into a single. My rationale is that `$plot1` is ultimately depending on `position()` ***five times***, and you cannot always guarantee the order of execution, so it's not unreasonable to think that one change to `position()` will cause `ggplot(.)` to fire five times before it is caught up. Better yet, get rid of the four, and either calculate them within the `$plot1` reactive itself, or in the `reactivePoll(.., valueFunc=)`. – r2evans Mar 03 '22 at 18:52
  • 1
    Thanks @r2evans. I put them in `$plot1`. – umair durrani Mar 03 '22 at 19:00
  • 1
    Another point: get rid of `open(...)` in python. The `pandas.to_csv` function takes care of this for you, and doing this might be hanging file-descriptors around (I don't know enough python internals to know its garbage-collection methods well enough). – r2evans Mar 03 '22 at 19:08
  • 1
    Thanks for pointing that out. I have commented out `open()`. – umair durrani Mar 03 '22 at 19:11

1 Answers1

3

Without even looking at the shiny portion of this, reaching out to the filesystem for a CSV file every 0.2 seconds has got to be a huge bottleneck and unlikely to be the best way to go with performance in mind.

The most likely reason you're getting permission denied may (I haven't tested) be due to file-locking, where pandas has temporarily locked the file while writing to it, and R is trying to read it too soon. Frankly, even if you were not getting a "denied" error, it is certainly feasible to try to read the file while it is mid-write, meaning incomplete data. There should be some coordination of the write and read events such that this cannot happen.

Some thoughts, without testing (but with experience in mutual-file-access):

Not Filesystem-based

One alternative is to use some form of streaming-data mechanism such as Redis. This can be a simple "topic" (fifo queue) or with a little more thought (and depending on your needs) a Pub/Sub setup. With this, pandas would push its new data to the topic or pubsub topic, and one (if a vanilla topic) or one-or-more (if pubsub) consumers would get the data in its entirety.

Advantages:

  • Not using a filesystem, so the biggest bottleneck will be network bandwidth, likely much lower latency than filesystems, and writes and reads are always atomic (meaning no read-while-being-written problems like you're facing);
  • With pub/sub, any client can start "late" and get all past data (if desired) without impacting any other consumers. Realize that "another consumer" may be just you monitoring things, it doesn't necessarily have to be a full-time processing program.

Disadvantages:

  • Requires Redis (or Apache Kafka or RabbitMQ or something similar) as a service somewhere on the network, the closer (topologically the better).
  • Requires a little more thought into the architecture of cooperation between pandas and R. This will reap benefits.

This is actually quite easy to do on your dev-computer using Docker: the Redis image is free and performs very well, I use it regularly for purposes similar to this. (Docker is not required, Redis installs just fine without it, over to you.)

(Python has redis-py, R has redux.)

Filesystem-based

If you must go with file-based, then you need to use a method that completely mitigates the risk of reading-while-writing. While file-writing is not atomic (which is why you have problems), file-renaming is. Write the file to a temporary file (on the same filesystem, but not in a place or with a name that R will read) and then, once written/closed, rename it so that R will see it.

For instance, let's assume that your convention is to use /some/path/file1234.csv where perhaps the 1234 increments with each write. (You might instead have time, it doesn't matter.) Let's restrict R so that it only sees files that end in the literal .csv (not difficult). In pandas, write to /some/path/file1234.csv.temp, and when complete (and you close() in python!), rename it to /some/path/file1234.csv. As soon as the file is renamed, R should be able to read it without distraction.

Advantages:

  • No change in architecture, likely the fastest to implement and test.

Disadvantages:

  • Still based on the filesystem, which means compounding latency from: network (if using a network filesystem), OS, HDD, etc.

If you're really curious, MailDir is a directory structure that I use for identical purposes, though it's working off of a huge GPFS (NFS-like) where latency of file-creation can be upwards of 10-15 seconds, file-locking is not supported (reliably), and without the file-renaming atomicity I mentioned above, I would be sunk. Surely you do not need the "complexity" (not much, but relatively more complex) of a maildir structure for passing files between pandas and R, but ... the premise of atomic file-renaming has precedence and has had a lot of people work on it. (Maildir scales very well, afaict. The only thing I haven't attempted to figure out yet is filesystem-based pubsub in maildir ...)

(Python and R both do file-renaming atomically, no non-standard modules/packages required.)

Possible change to python (untested):

+ import os

  # ...

      dict = {'y': fields[0], 'x': fields[1], 'z': fields[2]}
      my_data = pd.DataFrame([dict], columns=dict.keys())
-     open("position.csv", "w")
-     my_data.to_csv("position.csv", index=False)
+     my_data.to_csv("position.csv.temp", index=False)
+     try:
+       os.remove("position.csv")
+     except:
+       pass
+     os.rename("position.csv.temp", "position.csv")

It might be good enough to os.remove("position.csv") before renaming instead of using .old, I haven't tested to see what works best. I'm not as concerned about reading-processes, since on most systems the file itself (regardless of the inode on the filesystem) should allow R to continue reading even if the filename has been deleted. Again, not well tested.

Other Considerations

  • File format: while CSV is standard and easy, you may want to consider something that reads and writes faster such as feather. Both python and R have modules/packages that will support this. I don't have experience with it, but perhaps https://rstudio-pubs-static.s3.amazonaws.com/207316_edcc0ea0a7c04ea5a63833aaea7051fb.html is a useful start. (There's also "parquet", I have no experience with this either.)

  • Down-sampling: do you really need the data to be read every 0.2 seconds? Shiny (R!) is not going to guarantee reading every file right away, so there will be some "slop" here anyway. I recommend down-sampling to perhaps 1/sec or once per 2-3 seconds, depending on how you really intend to use it. I recognize some use-cases are not compatible with this recommendation, over to you.


Redis option:

Redis running on my laptop using the default port 6379. I'm using docker, so I started it with this. (Note that if you have special networking setup and/or not able to use host-mode for listening ports, this might need adjusting. It should just work, though, no other configuration required.)

$ docker run -p "6379:6379" --name some-redis -d redis

In python:

import pandas as pd
import redis
r = redis.Redis(host='localhost', port=6379, db=0)
df1 = pd.DataFrame(data={'col1': [1, 2], 'col2': [3, 4]})
df2 = pd.DataFrame(data={'col1': [11, 12], 'col2': [13, 14]})
r.rpush('carsim', df1.to_json(orient='records'))
r.rpush('carsim', df2.to_json(orient='records'))

In R:

R <- redux::hiredis()
popped <- R$LPOP("carsim")
popped
# [1] "[{\"col1\":1,\"col2\":3},{\"col1\":2,\"col2\":4}]"
jsonlite::fromJSON(popped)
#   col1 col2
# 1    1    3
# 2    2    4
popped <- R$LPOP("carsim")
jsonlite::fromJSON(popped)
#   col1 col2
# 1   11   13
# 2   12   14
popped <- R$LPOP("carsim")
popped
# NULL

Adaptation of maildir-like directories

This trick uses a premise of maildirs with a little bit of compromise (which should be fine so long as there is only one file-writer, the car-sim).

In python:

-     my_data.to_csv("position.csv", index=False)
+     filename = '{:.3f}.csv'.format(time.time())
+     my_data.to_csv('tmp/' + filename, index=False)
+     os.rename('tmp/' + filename, 'new/' + filename)

And in R:

files <- list.files("new/", full.names = TRUE)
dat <- rbindlist(lapply(files, fread))
file.rename(files, "cur/")
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • Thank you for a detailed answer. I want to point out that the intervals for reading and writing are different. Writing: every 1/60th of a second. Reading: every 0.2 s (I can increase it to 1 sec). Secondly, I have also tested with `sqlite3` database. There is no issue with read/write but I occasionally get the error `database is locked` in R, which I assume is due to rapid writing from python. Do you think `sqlite3` will always be faster than a file-based workflow? – umair durrani Mar 03 '22 at 18:57
  • 1
    I tried using a SQLite database for the same reason. Unfortunately, while it is generally fine for multiple *read* connections, any *write* operation locks the entire file, blocking any reads (refs: https://www.sqlite.org/faq.html#q5 and https://stackoverflow.com/q/10325683/3358272). Really, if you want to consider SQL, go with something that truly supports simultaneous write/read operations ... PostgreSQL, MariaDB, SQL Server all come to mind. I do this (daily!) as well, and it can work, but not with sqlite unfortunately. – r2evans Mar 03 '22 at 19:00
  • Thanks a lot for all your help @r2evans. I have no background in databases so it is going to be tough :) But I will look into Redis and your other recommendations. – umair durrani Mar 03 '22 at 19:04
  • 1
    Ultimately, your issue is concurrent file access, and ... with my research (and I've been doing this *exact* thing for a few years now in a highly-parallel environment), if you are truly filesystem-bound, then the only way you can "guarantee" safe concurrent file access is to use atomic file-renames, and a clear contract: the only process that is allowed to read a `.csv.temp` file is the process that created it (which it should need to do anyway, so ... never read `.temp` files); and nobody writes to a `.csv` file, *ever*. – r2evans Mar 03 '22 at 19:04
  • 1
    I think your least-expensive path "today" is to do the file-renaming trick: change to `my_data.to_csv("position.csv.temp", index=False)` then follow that with `os.rename("position.csv.temp", "position.csv")` and see if you get "denied" errors. See my recent edit with suggested python code. – r2evans Mar 03 '22 at 19:12
  • 1
    Another docker hint: starting up postgres or mssql or similar in a docker instance is *too easy* to not consider using it in your lab. You don't need to advertise it to the world for it to be sufficient for your needs; you may not even care to use a data-volume for table persistence, over to you on that one. – r2evans Mar 03 '22 at 19:15
  • 1
    The more I think of it ... a DBMS (other than sqlite) adds some DBA administrivia that may not be so trivial: if you continually push new data to a table, then the reading side either (a) reads all data, which seems grossly inefficient; or (2) reads data more recent than the last read. The second part requires a timestamp field (e.g., `Created`), but as data grows you will *need* good indexing on the table. If it is *just* adding an index on `Created`, that may be fine, but if you intend data-persistence then you will likely need more. – r2evans Mar 03 '22 at 19:20
  • The file renaming trick worked for a few seconds. But then `FileExistsError: [WinError 183] Cannot create a file when that file already exists: 'position.csv.temp' -> 'position.csv'`. – umair durrani Mar 03 '22 at 19:24
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/242577/discussion-between-r2evans-and-umair-durrani). – r2evans Mar 03 '22 at 20:00
  • 1
    @umairdurrani, see my edit for redis on both sides, python and R. In this case, the redis server is on my laptop where I'm running R and python. Good luck! – r2evans Mar 04 '22 at 01:27
  • 1
    And one more edit showing a file-based but with better management that shouldn't result in the `FileExistsError` problem. – r2evans Mar 04 '22 at 12:05
  • Quick update: Thanks for the file-based solution. It throws an error with `file.rename(files, "cur/")`. But I also tried by commenting out this line. That worked but the app was slower than before. I am now reading about Redis to try out your solution. Alternatively, I am looking into `streamlit` for app dev in python to remove this step of writing and reading the data. Thanks for all your help. – umair durrani Mar 04 '22 at 15:49
  • 1
    Thank you very much! Redis database worked for me. I no longer get the database locked answer. The app runs smoothly. Quick question: after I close the shiny app, how can I delete the redis database? Would a simple restart of redis be sufficient? – umair durrani Mar 17 '22 at 16:52
  • 1
    Unless specifically configured otherwise, Redis used in this instance is memory-only and therefore non-persistent. This can be both a strength and a weakness, depending on your needs, and is a constant topic for discussion when comparing Redis with other data key/value, nosql, non-relational, queueing, pubsub, and many other data-structure streaming/non-streaming relays (did I hit the buzzword bingo there? I tried). Further, if you are just using basic queues in Redis, there is no "go back in time" ... once popped, it is gone, so effectively this is self-clearing. – r2evans Mar 17 '22 at 21:48
  • 1
    Just to be safe I also run flushdb command after closing the shiny app. Thank you again for your time and guidance. – umair durrani Mar 17 '22 at 21:59