Convert .CSV files to .DTA files in Python

Question

I'm looking to automate the process of converting many .CSV files into .DTA files via Python. .DTA files is the filetype that is handled by the Stata Statistics language.

I have not been able to find a way to go about doing this, however.

The R language has write(.dta) which allows a dataFrame in R to be converted to a .dta file, and there is a port to the R language from Python via RPy, but I can't figure out how to use RPy to access the write(.dta) function in R.

Any ideas?

Get a specification of the DTA file and parse the CSV accordignly? — Tymoteusz Paul, Oct 10 '13 at 12:50
I don't seem to understand what does it matter here that it is a binary file as you can work with python on binary data just fine. — Tymoteusz Paul, Oct 10 '13 at 13:13
@Parseltongue: have you thoroughly read the RPy docs? P.S. basically, does the question boil down to *"How to write DFA files in R?"*? — Erik Kaplun, Oct 10 '13 at 13:19
http://stackoverflow.com/questions/7503487/save-dta-files-in-python might be useful - have you tried? — Spacedman, Oct 10 '13 at 13:26

score 4 · Accepted Answer · answered Oct 10 '13 at 13:25

You need rpy2 for Python and also the foreign package installed in R. You do that by starting R and typing install.packages("foreign"). You can then quit R and go back to Python.

Then this:

import rpy2.robjects as robjects
robjects.r("require(foreign)")
robjects.r('x=read.csv("test.csv")')
robjects.r('write.dta(x,"test.dta")')

You can construct the string passed to robjects.r from Python variables if you want, something like:

robjects.r('x=read.csv("%s")' % fileName)

score 2 · Answer 2 · edited May 23 '17 at 12:06

2

(copypasting from my answer to a previous question)

pandas DataFrame objects now have a "to_stata" method. So you can do for instance

import pandas as pd
df = pd.read_stata('my_data_in.dta')
df.to_stata('my_data_out.dta')

DISCLAIMER: the first step is quite slow (in my test, around 1 minute for reading a 51 MB dta - also see this question), and the second produces a file which can be way larger than the original one (in my test, the size goes from 51 MB to 111MB). Spacedman's answer may look less elegant, but it is probably more efficient.

edited May 23 '17 at 12:06

Community

1
1

answered Apr 15 '14 at 09:00

Pietro Battiston

7,930
3
42
45

1

Warning to those unfamiliar with Stata: Be aware that the .dta format is not a constant, but dependent on version of Stata. Stata X can read .dta files for version X or lower, but it cannot necessarily read .dta files for higher versions. The format has changed about every 2 versions on average, so about once per 4 years. There is documentation. It's my impression that R is responsive to these changes, so going through R would usually be a good solution. I can't comment on Pandas. – Nick Cox Apr 15 '14 at 09:31
@NickCox true. I can only say that pandas was able to open a version later than X (don't know which one, but my STATA X was not able to open it), and then the exported dta could be opened with STATA X. – Pietro Battiston Apr 15 '14 at 15:20
Sounds good for you, except if the conversion process is downgrading the data and creating inconsistencies between you and other people using the "same" data. Unlikely, but watch out. As in my comment, correct program name is Stata. – Nick Cox Apr 15 '14 at 15:31
Yep, Stata, sorry. In my case, I verified all my results were reproducible as with the original. That said, the source code does warn for a couple of "NOT IMPLEMENTED" (minor, as far as I can judge) features: https://github.com/pydata/pandas/blob/master/pandas/io/stata.py – Pietro Battiston Apr 16 '14 at 21:40

Convert .CSV files to .DTA files in Python

2 Answers2

Linked