2

So I know it's possible to read in either Stata categorical labels or values using the convert_categoricals parameter.

I was looking for a way to write/export a pandas dataframe to Stata and include the value labels. However all I could find was either

data_label : str, optional for the dataset label

or

variable_labels : dict for column names label,

but nothing for the values themselves.

Andrei
  • 29
  • 6
  • Hi! I think this answers your question: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_stata.html – Serge de Gosson de Varennes Dec 12 '20 at 10:39
  • Thanks. I should have stated that I'd looked at the docs and not found answer. – Andrei Dec 12 '20 at 11:01
  • I'm surprised, but the docs indeed don't indicate an option to do this. It looks like it is not possible at the moment. – Wouter Dec 13 '20 at 09:49
  • 2
    @SergedeGossondeVarennes, @Wouter I added a workaround solution to @Andrei question that works within Stata using `Stata Function Interface (sfi)`. I don't know if this solves your problem because it assumes you have Stata 16 running on your terminal, but unfortunately, I couldn't find a way to export label values using `pd.to_stata()`. – Álvaro A. Gutiérrez-Vargas Dec 13 '20 at 12:05

3 Answers3

5

Here is an answer to your question. It is probably not what you were expecting because I am not using pd.to_Stata, but the Python integration developed on Stata 16.

The code below must be executed within Stata (from version 16 onwards). Briefly, I am generating a Pandas Data.Frame (df) that I will export to Stata. The trick is to apply the labels on the values using the ValueLabel.setLabelValue() functionality that comes from the sfi library.

clear all

python:
from sfi import ValueLabel, Data
import pandas as pd

data = [['Eren Jaeger', 15, 1, 'Soldier' ] , ['Mikasa Ackerman', 14, 1, 'Soldier'], ['Armin Arlert', 14, 1 , 'Soldier'],['Levi Ackerman', 30, 2, 'Captain']]  
#creating DataFrame
df = pd.DataFrame(data, columns = ['Name', 'Age', 'Rank_num', 'Rank'])

##              Name  Age  Rank_num     Rank
##0      Eren Jaeger   15         1  Soldier
##1  Mikasa Ackerman   14         1  Soldier
##2     Armin Arlert   14         1  Soldier
##3    Levi Ackerman   30         2  Captain


# Set number of observations in Stata
Data.setObsTotal(len(df))

#Create variables on Stata (from Python)
Data.addVarStr("Name",10)
Data.addVarDouble("Age")
Data.addVarInt("Rank_num")

#Store the content of "df" object from Python to Stata
Data.store("Name", None, df['Name'], None)
Data.store("Age", None, df['Age'], None)
Data.store("Rank_num", None, df['Rank_num'], None)

# HERE is where I solve your question!
# 1) Create the labels
ValueLabel.setLabelValue('rank_num_LABEL', 1, 'Soldier')
ValueLabel.setLabelValue('rank_num_LABEL', 2, 'Captain')
ValueLabel.getValueLabels('rank_num_LABEL')

# 2) Attach the labels to the created variable
#Attach the created label 
ValueLabel.setVarValueLabel('Rank_num', 'rank_num_LABEL')

end 

br

* At the end, you will see the following on the Stata browser
* Name              Age Rank_num
* Eren Jaeger       15  Soldier
* Mikasa Ackerman   14  Soldier
* Armin Arlert      14  Soldier
* Levi Ackerman     30  Captain

In case you want to understand better the reasoning behind the code above, here are the references that I used to learn it.

  1. Stata/Python integration part 9: Using the Stata Function Interface to copy data from Python to Stata
  2. Stata/Python integration part 8: Using the Stata Function Interface to copy data from Stata to Python
1

The pandas equivalent to a Stata variable with numerically encoded string values is the Categorical dtype. Exporting a Categorical column with the to_stata method will export it as such. Taking the example of Álvaro A. Gutiérrez Vargas:

data = [['Eren Jaeger', 15, 1, 'Soldier' ] , ['Mikasa Ackerman', 14, 1, 'Soldier'], ['Armin Arlert', 14, 1 , 'Soldier'],['Levi Ackerman', 30, 2, 'Captain']]
df = pd.DataFrame(data, columns = ['Name', 'Age', 'Rank_num', 'Rank'])
df['Rank'] = df['Rank'].astype('category')
df.to_stata('YOUR/PATH/HERE', write_index=False)

This will create a Stata dataset with a Rank variable encoded as 0=Captain, 1=Soldier. One could change the order by using Categorical.reorder_categories() or Categorical.set_categories(), for example:

df['Rank'] = df['Rank'].cat.reorder_categories(['Soldier', 'Captain'], ordered=True)

Now, exporting with the to_stata method will use encoding 0=Soldier, 1=Captain.

There is no way to specify a custom encoding though, so if you need something more specific than a 0 to max encoding, you should go with the method of Álvaro A. Gutiérrez Vargas.

Wouter
  • 3,201
  • 6
  • 17
  • 1
    Hi @Wouter, this is an interesting point because it clarifies the lack of direct equivalent with the `label values` that Stata has. Additionally, it could be a _workaround to my workaround_ because you could use the encoding you proposed to label your values using post routines within Stata on versions older than 16 (I am mostly thinking of string manipulation together with the `label values` command). – Álvaro A. Gutiérrez-Vargas Dec 14 '20 at 15:05
1

As of 2023 April, pandas allows you provide "value_labels" in pd.DataFrame.to_stata(). If you look at the code of "to_stata" method and you can find the description for adding variable labels, data label as well as value labels: Here is a piece from that description:

....

value_labels : dict of dicts

Dictionary containing columns as keys and dictionaries of column value to labels as values. Labels for a single variable must be 32,000 characters or smaller.

....

Example: If for column "animals" that can take two values [1,2] you want to set labels ['Cat', 'Dog] the in pd.DataFrame.to_stata() you provide:

value_labels = {'animals': {1: 'Cat', 2: 'Dog'}}

Farkhad
  • 11
  • 1