2

I'm looking to automate the process of converting many .CSV files into .DTA files via Python. .DTA files is the filetype that is handled by the Stata Statistics language.

I have not been able to find a way to go about doing this, however.

The R language has write(.dta) which allows a dataFrame in R to be converted to a .dta file, and there is a port to the R language from Python via RPy, but I can't figure out how to use RPy to access the write(.dta) function in R.

Any ideas?

Parseltongue
  • 11,157
  • 30
  • 95
  • 160

2 Answers2

4

You need rpy2 for Python and also the foreign package installed in R. You do that by starting R and typing install.packages("foreign"). You can then quit R and go back to Python.

Then this:

import rpy2.robjects as robjects
robjects.r("require(foreign)")
robjects.r('x=read.csv("test.csv")')
robjects.r('write.dta(x,"test.dta")')

You can construct the string passed to robjects.r from Python variables if you want, something like:

robjects.r('x=read.csv("%s")' % fileName)
Spacedman
  • 92,590
  • 12
  • 140
  • 224
2

(copypasting from my answer to a previous question)

pandas DataFrame objects now have a "to_stata" method. So you can do for instance

import pandas as pd
df = pd.read_stata('my_data_in.dta')
df.to_stata('my_data_out.dta')

DISCLAIMER: the first step is quite slow (in my test, around 1 minute for reading a 51 MB dta - also see this question), and the second produces a file which can be way larger than the original one (in my test, the size goes from 51 MB to 111MB). Spacedman's answer may look less elegant, but it is probably more efficient.

Community
  • 1
  • 1
Pietro Battiston
  • 7,930
  • 3
  • 42
  • 45
  • 1
    Warning to those unfamiliar with Stata: Be aware that the .dta format is not a constant, but dependent on version of Stata. Stata X can read .dta files for version X or lower, but it cannot necessarily read .dta files for higher versions. The format has changed about every 2 versions on average, so about once per 4 years. There is documentation. It's my impression that R is responsive to these changes, so going through R would usually be a good solution. I can't comment on Pandas. – Nick Cox Apr 15 '14 at 09:31
  • @NickCox true. I can only say that pandas was able to open a version later than X (don't know which one, but my STATA X was not able to open it), and then the exported dta could be opened with STATA X. – Pietro Battiston Apr 15 '14 at 15:20
  • Sounds good for you, except if the conversion process is downgrading the data and creating inconsistencies between you and other people using the "same" data. Unlikely, but watch out. As in my comment, correct program name is Stata. – Nick Cox Apr 15 '14 at 15:31
  • Yep, Stata, sorry. In my case, I verified all my results were reproducible as with the original. That said, the source code does warn for a couple of "NOT IMPLEMENTED" (minor, as far as I can judge) features: https://github.com/pydata/pandas/blob/master/pandas/io/stata.py – Pietro Battiston Apr 16 '14 at 21:40