1

I have a Python (3.6) script which reads data from a csv file into a pandas dataframe, pandas performs actions for each new line which is read from the CSV file...

This works fine for a static CSV file, e.g. one where all the data to be processed is already contained within the CSV file...

I would like to be able to append to the CSV file from another Python process so that data can be continuously fed into the pandas dataframe, or if the process that feeds the data to pandas reaches the end of the file, it waits for a new row to be appended to the CSV file and then continues reading rows into pandas...

Is this possible?

I am new to pandas and at the moment, I am having difficulty understanding how pandas can be used with real time/dynamic data as all the examples I see, seem to use static CSV files as a data source.

Ideally, I would like to be able to feed rows into pandas from a message queue directly, but I don't think this is possible - so I was thinking that if I have a second Python script that receives a message from a queue then appends it as a new row to the CVS file, the original script could read it into to pandas...

Am I misunderstanding how pandas works or can you give any pointers on if/how I can get this sort of thing to work?

Mark Smith
  • 757
  • 2
  • 16
  • 27
  • 1
    If you control the process that can potentially append data to the CSV, why not instead have that process communicate the new data to the other process, via a web service or something? Say Process A is the one which uses pandas, reads a cache of data, and reacts. Process B can communicate with process A either by appending to that shared CSV, or by writing to another location which A knows to check, or letting A make direct requests for B to provide the data. – ely Jan 25 '18 at 18:55
  • Thanks for your comment - I'm new to pandas/python so will need to spend a little time to digest this to fully understand it and see if/how I can implement it... – Mark Smith Jan 25 '18 at 18:58

2 Answers2

3

You can pop comma separated values off a queue and wrap them in a dataframe.

You can then take that in-memory tiny dataframe and append it to whatever other dataframe you want, that's also in memory. You can also write it out to a file with .to_csv('whatever', mode='a').

It would be preferable to not write to csv in the first place and leave it an array of strings, but since this more directly answers your question:

big_df = pandas.read_csv('file.csv')

def handle_csv(csv):
    mini_df = pd.DataFrame([sub.split(",") for sub in csv])
    big_df.append(mini_df)
    mini_df.to_csv("somefile", mode='a')
  • Thank you Hannah, do you have any links or more info - I am new to python/pandas so am a little unclear on what you have suggested so any more detail you can add will be much appreciated - thanks – Mark Smith Jan 25 '18 at 20:53
  • 1
    Hmm. So.. You basically want to define your schema for your dataframe - what are the columns - when you construct your base dataframe (big_df above). See doc [here](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html). My code snippet is describing reading values off a queue, and appending them to a growing dataframe in memory. Is that what you had in mind? – Hannah Lindsley Jan 25 '18 at 21:02
  • Thanks for getting back - I'm running pattern analysis on the data and want to be able to start with loading historic data and then update it as new data points (rows) are received from a message queue - so that pandas can then run the pattern analysis again taking the new data into account... If I understand your code above, I think I could load the existing csv file as big_df and then append data from a different csv file loaded as mini_df, then clear out data from the new csv file and loop the process as new data is received... sorry for my newbiness and thanks again for your help! – Mark Smith Jan 25 '18 at 21:26
  • Sure, that sounds doable with this. I'll update the comment to reflect more precisely what you're doing. – Hannah Lindsley Jan 25 '18 at 21:34
  • 1
    Unless you *need* to append data from a different csv, it'd be easier to do it directly from the queue, omitting the step of writing it out to a csv and then reading it back in. FWIW. – Hannah Lindsley Jan 25 '18 at 21:39
  • Thank you for your help, I'm going to see what I can do - it may take a few hours as I am very new to pandas - but I think you have helped point me in the right direction and I will accept your answer as correct if so! – Mark Smith Jan 25 '18 at 21:56
2

You could try and use pandas read_csv() function to read the big csv file in small chunks, the basic code is written below:

import pandas as pd
chunksize = 100
for chunk in pd.read_csv('myfile.csv', chunksize=chunksize):
    print(chunk)

See here for more: http://pandas.pydata.org/pandas-docs/stable/io.html#io-chunking

... Although I'm not completely sure how this will interact with a non-static file and if this would be the best solution... controlling the read chunk to be far enough away from the end of the file could be one solution.

Funsaized
  • 1,972
  • 4
  • 21
  • 41
  • 1
    Did a quick search... writing your own tail function to iterate lines and follow new lines as they appear as done here could be another route https://lethain.com/tailing-in-python/ ... just write the output data to the dataframe in whatever method you choose... – Funsaized Jan 25 '18 at 18:54
  • Thanks for your suggestion, I'm not sure this would work for my specific case as the new row would be written every x secs/mins so I can't use the chunking to avoid reaching the end of the file before a new line is written... – Mark Smith Jan 25 '18 at 18:55
  • 1
    The python logging functionality has some promise. It was designed exactly for this case (capturing real time data)... https://docs.python.org/2/library/logging.html . It handles the buffering and file management for you. Pandas is designed to read large data files efficiently. Using both tegether they can solve your data management problem. Additionally, this gives you a copy of data to reference later. – Funsaized Jan 25 '18 at 18:57
  • Once again thank you for your suggestion - I will look into this and come back to you! – Mark Smith Jan 25 '18 at 18:59
  • 1
    Lastly check out HDF5/pytables... see this answer https://stackoverflow.com/a/34282362/3794944 – Funsaized Jan 25 '18 at 19:00