-1

So at my work we have to work in .sav files (SPSS files). Reason being for standardized purposes. I'm curious if i can read SPSS/.sav files into pandas as a csv and essentially bypass reading it in as an sav?

So for example, when i read in files in then convert to a csv i typically do this:

df = pd.read_spss('filepath.sav')
df.to_csv('filepath.csv')
df = pd.read_csv('filepath.csv')

this is extremely inefficent and SLOW, because reading in .sav files is a slow/time consuming process.

so what i'm wondering, is can i read .sav files as .csv files without needing to first read it in as a .sav?

Nate
  • 136
  • 10
  • there is currently an open issue about reading spss files performance: https://github.com/Roche/pyreadstat/issues/80. Please provide a sample file to investigate. Otherwise I don't think what you are asking for is possible: spss and csv files are very very different so programs working for csv won't help you. What you can do tough is to save the spss files as csv copies and use the csvs for your work. – Otto Fajardo Oct 31 '20 at 11:40
  • pandas read_spss uses pyreadstat under the hood. Version 1.0.3 of pyreadstat has improved performance, so you can git it another try to pandas.read_spss. In addition pyreadstat has now a new version read_file_multiprocessing to read the files in parallel processes making things even better. To use the later you need to use pyreadstat as pandas does not expose that functionality. – Otto Fajardo Nov 06 '20 at 17:07

2 Answers2

0

You might be interested on this topic. In short, it points to a wrapper around the C library ReadStat that reads SPSS files way faster than pandas.

The link to their GitHub repo is https://github.com/Roche/pyreadstat

Jorge Abreu
  • 71
  • 1
  • 4
  • i use pyreadstat, but i'm trying to read .sav files as a .csv so i can use plugins like modin/ray/dask. – Nate Oct 28 '20 at 18:22
0

Doesn't pd.read_spss return a DataFrame just like pd.read_csv ?

  • it does, but i'm trying to use a plugin like dask/modin/ray in order to speed up processing and those plugins don't allow for .sav files – Nate Oct 28 '20 at 18:21
  • Does this plugin use the pickle read csv function, wich you could edit. Or does it use a DataFrame wich you could suplly from spss. I ask the second because you are storing the result in a dataFrame i belive ? which you then would probably supply to the Plugin. Or is it something else entirely? – John Janzen Oct 28 '20 at 18:57
  • i don't think it does. really i'm open to trying anything to read sav files in faster. would/can i pickle read sav's? – Nate Oct 28 '20 at 19:05
  • What are you really doing? Are you calling a function with a DataFrame argument? I don't know what you are trying to do differently. If you just pd.read_spss(PATH) gives you the same data output as a .csv file containing the same data. It just looks different to you looking at the file i suppose. – John Janzen Oct 28 '20 at 19:32