4

I have a relatively large .dta file, with 1280000 observations, that works fine in Stata, but I am having troubles importing it into R. The data was created with Stata 15, the data contains strL or str#, #>244 variables and cannot be saved in the Stata 12 format.

I am trying to use the haven package to import the saved data using read_dta() but it is giving me the following error message: "Failed to parse /Users/folder/my_data.dta: Unable to allocate memory." Does anyone know what might be causing this problem and how to overcome it to be able to import the data in R?

I have attempted to overcome the issue in multiple ways, but none of my attempts appears to work.

  1. First I tried to expand the memory size of my r environment using Sys.setenv('R_MAX_VSIZE'=32000000000) but the console reports the same error when I try to import the data. The problem does not appear to relate to the size of my memory in R.

  2. I tried to save the data in Stata 13 format, using saveold my_data13, version(13) in Stata, but trying to import it into R with haven still produces the same error message.

  3. I tried to use readstata13 function read.dta13(my_data13), but this regularly end up in R crashing.

The strange thing is that I am able to open the data correctly in Stata, by simply double-clicking on it.

Does anyone has any suggestion on how to address this issue? Any insight on a) the meaning of the error message and how to address it 2) alternative packages able to ope stata15 files c) approach to be able to open the data in R would be most welcome.

Thanks a lot in advance for your help

Best Regards

Alex
  • 1,207
  • 9
  • 25
  • If any workaround fails, what about exporting to another file type like csv in Stata, then importing in R using the csv file? – Wouter Dec 15 '20 at 12:09
  • I agree with @Wouter that you could export the file into something like CSV, even though I can foresee some difficulties given the URL structure you are working with. Another approach would be to try to [import the .dta file it into Python](https://stackoverflow.com/questions/19744527/pandas-read-stata-with-large-dta-files?rq=1) and then from Python [export it to R](https://stackoverflow.com/questions/43046688/python-dataframe-write-to-r-data-format). – Álvaro A. Gutiérrez-Vargas Dec 15 '20 at 18:00
  • Dear Alvaro, thank you very much for your insight, unfortunately, as you predict, I have issues in exporting to a csv files, as the long strings, are stored incorrectly by the csv file and re-opening it creates further problems.. I will try the python option, but it is a pity there is no direct way to do this in R – Alex Dec 15 '20 at 18:11
  • So I was able to have access to the raw CSV file, I have imported it into R, and then I try to store it with `write_dta(aparm_f,"data.dta")` and I get the following error `Writing failure: A provided string value was longer than the available storage size of the specified column.` could the two issues be related? – Alex Dec 15 '20 at 18:25
  • @Alex, fortunately, I saw your comment (don't forget to tag people when commenting). In my opinion, and I am guessing here, the problems seem to be related. Now with this information, I am a bit confused about your final goal. Do you want to work with the data in Stata or R? – Álvaro A. Gutiérrez-Vargas Dec 15 '20 at 18:39
  • Dear @ÁlvaroA.GutiérrezVargas, thanks for your reply. My goal is to work with both, I prefer using R, while a colleague of mine prefer Stata. And there are certain things we prefer doing with one R and others with Stata. The bottom line is that I would like to be able to go from one program to the other quite easily. but we face the impossibility to move smoothly from one program to the other. – Alex Dec 15 '20 at 18:44
  • Dear @Alex, I see that the problem is harder to solve than it seems, in particular, because of the extension of the URL. Assuming that you have access to Stata version 15 or later (due to the length of the strings that can be saved), I would process whatever you wanted to do, using either R or Stata, and save the output into CSV. Later, you can continue from the saved CSV file and move into the next analysis in whatever software you like, without being restricted to `.dta` o `.rds` files. – Álvaro A. Gutiérrez-Vargas Dec 15 '20 at 18:57
  • 1
    Dear @ÁlvaroA.GutiérrezVargas, thanks a lot for your reply. Indeed I fear the problem relates to the fact that two of my character strings have up to 4200 characters. Using cvs is probably the best option if no-one is able to give insights on how to import such `.dta` files with long strings in R. – Alex Dec 15 '20 at 19:11
  • In case someone is interested, I was able to overcome the problem in stata. The issue were strL variables. I created a loop to convert all of the strL variables into str# variables, and then I was able to save the data in Stata and importing it regularly in R using `haven`. here is the loop for Stata: `foreach var of varlist var1 var2 { generate str 'var'_str = 'var' replace 'var' = "" compress 'var' replace 'var' = 'var'_str drop 'var'_str describe 'var' }` – Alex Dec 15 '20 at 22:41
  • @Alex, I wrapped up the whole discussion we had in an answer to your question, mainly because I was curious about this whole issue. If you have any questions or comments, I'd be more than happy to read you. – Álvaro A. Gutiérrez-Vargas Dec 16 '20 at 14:39

1 Answers1

2

I just want to wrap up everything mentioned in the comments.

From the official Stata documentation, we have the following:

strL variables can be 0 to 2-billion bytes long. strL variables are not required to be longer than 2,045 bytes. str# variables can store strings of up to 2,045 bytes, so strL and str# overlap. This overlap is comparable to the overlap of the numeric types int and float. Any number that can be stored as an int can be stored as a float. Similarly, any string that can be stored as a str#, can be stored as a strL. The reverse is not true. In addition, strL variables can hold binary strings, whereas str# variables can only hold text strings. Thus the analogy between str#/strL and int/float is exact. There will be occasions when you will want to use strL variables in preference to str# variables, just as there are occasions when you will want to use float variables in preference to int variables.

The above information makes me infer two points before I try to replicate your problem.

  1. In your case, the use of strL was not necessary.

  2. The use of strL was the source of your problem, with a probable result some compatibility problem with the haven library.

However, after trying to replicate what you described, I arrived at a different conclusion.


Please gently consider the following code that emulates your problem.

version 16
clear all 
set obs 1
*4200 character string generated here: http://www.unit-conversion.info/texttools/random-string-generator/
gen strL str_var_in_strL = "eSw0qZcVs5DHU2GxgRo1Seo9uTwJ0MvHXyYUQidJMRWw8KW1310Ec242O6D4xrLziO4c56WgluSddTy0Q64QapkwGgOMZdy8ru0fyss0nwJvF4M3kBjYGF00ZsvQGYt4DjF51R3vxTzUx4xlApKwaoRADIgFlXvBh2Bug0VVhmXR3uInHDfpmID57kVWiyxX1gELdyPMVzJWizEHVx2GpjBsm1UdRphDdukFtFrnkr1HFRXBekxHkW3uOCHz0wnyDBfwitDGHosctRrWPhIjujnoalOaHkI5jbnENSNJEsOdGohoe5QKZIxtXmVbD4l8m8wLCbuSjZLw8NzU5vjPX57T2yWWasdFMIHk3kFipT0CG3dNForECS8UiW6ZWSIEmO2V62uakfrxTsRb9fIFVBUIHpGizeR0b27OnfSVB2wE2Ix0ij7kR19jz0wIh35fbwkJWqLq93pfHEtGu0FTb8H5A4XNOcR8chEAQBI7zV3rosSGnSP2h9QZtuSAcz1TrRHNMpCguvNf1DD72TCfCaiBXyflOCre7f5zchLA7k2cQ5qi4fBMVc9GnAdGB2vnjFeFlwaUD0AEUhfSJJINRQ2CKfJegqUL0jBgHBVy5cYCNxsP8Gu8NXRUo6vvyiTJMDcBkL0JKNOT4usSDi4v86cJNzQQa3ArafRzOv1RFz8BfI7pP7rXDLD6d1Z1miCqTZ8UtJBVQ0Z0eCQmTrlAvlu5busOjcAl4ZV7THH6qCV8tI53zh1THBfjnEgoPxy8UIaIK6tXDUM4RFMMd1366324mJEVwyvc5CWgzPian39Q3GFLl6zXCfD4pw7rSUmH5CNOmKgPihxPbV9NSBxiwVK3M07KFS2hZbf3ZDB4CBSJV9geFWKZlR3XNrsPudQgkpsdywNNjZTDwD2RiHF7kQAgyEW7q1w42OC2IbreBBtiPekx6yzCEWBEokLwfhrhbOnDwcnFmfKjnrxCbqypXrSnyvrUP2nUQ9vBmdxCqiVLrBHuDi6Wv2U4vyZ7dTqk84WmnwACXo5PbYY2dmhtjscLMpRw4Q6xVUEWC3qPMnQkbI1UKEq1NfOrF0X8nC0rqrwHQuNuJqHuebJj5AMXVgyZWTaqYIb4gkbGaEze3wNmHbbj1q2bmumiwd6RZRSdx7U3ZwozO9kTkZ69NHFSa2QDi8GrhgvBDMshJVaOR9K8tWcpa2QrFD7cI0ZqzneLXHXm6LsOmtZPFmikKfyts1pASGwZ8DzuWfT3j0daNmyk6y0HwHwM98KOyeuSXnQJOJzunAXkidv90hrgviWUhP70Nrx527JI7vpRr2dClBBnzO2O7YwjTdKTrmfcPs9z4iLeroo20Jg9ODHjvUYWtLRTOKrgvYAgywkj2PoVdwmuYK32UKcqH5EdhPHxWarjsqUuBb40u6nUGIQ0YS7ZuzsnVDerB8hO3rCl0FlMMYgRh4vdPcEG23JQoIwTvdujULg6Lpplyt6yK16UhVSklj6aVNIoA4zr51dULyOzWF9ZqlZz7l90QpXvLuRD9Elr5gxWvNW4fvkCAU3kpEv0s7gHS7ytjNxm0WLk74bN1iP8ZjcxXXBqwtatCo9e1Ayc59VYR9RVxtfvilb038WpHglhWEZoK91rumPSFiCJWUmlkL6P4SAbz5b6LDdW9ybiN8zdZmNtQ2px556d7DF5RRcXgLocLH37Uh6uU9cz2wmWRrcJS4rO9MkUe6KSuVjVLXSsk6J1bnvvagWl4BkY8ZPm0iBg4XXTkRAjfVgnfex1hee47b6k9c5gdS6AJSVazCPpXQJlGJ7NpyAn3hXdHkhaGtokTmne6Zag8DterOyDldPXHXwrG7PgtsmREc0VugLVPrYEbdf9QMHBtGQLwQz04Gyg2lspZ5HbGnOkfI0MTanuMN7XnWdcGBko9gmQKbpONgPqg8POcpxG2aRefswG090hvYKj5gzp3r1nitZZhBm8KDUT8P2Wy06hPxrkZinMGmBv2SIDegXr5uzceHymEnyMQZINS96QCyTiV7z1X1NZ3IBDfVPZTZ9bRxpKyMbAnYzFhx9PYSkescyMMtsGOEli1gFp2PWcqO4bpj0EnKjgWf9ae2R5nDKIkVbsNRCik3JrCM7WjHPfwdZSiA335Eyl0yoHQWjp6YJrR8ykOtw3zL2XHa2ilKIRSypG5dtDwjuqLI1fb7fB5wiG3LuowKqam8HY86aDsuu0DkpED9mAxoSvE7V6WPs5ptg31yoUOgGK1rvGtdpY7CHkaBmmv0jKNYjcZiuET6Q0If2IO36HafXJN8onjMvYadAypEY4IAxkmU2yemCFQkdDBuhr8G4DWcGO46W24QOvMghH6k3HVHeDgj5dNXKIz3rqbCC6tSNMptuQhoM2eWnwBFw6PzNIqKwB6qxbJMs9wxqvDEJlQkKgMZN4HJu1hNFpXIePLN8dSsV8xhgb1pxwFaqhLQMoYXcgobOdcrb7DDpJFbWhUXKn3WHEjO8nk4EAuNmIUdyfWxwxPPmTLEyohT7QrcjvRph82n64aRJIcyDPCho8pqtCTve84PAp4jIeechI8sl92e94jsX7XTZu3LaqDGkEEtcmp69ZqPA5Ev9NZv5ovmWNiu39kKu7QW0YnXGCvorirdScdCow4NyLgnpoAEG0FPz52oN7xU5xyTgY0x5Hel5GDsA28Qoy463tyBNuT51gQROr9XqgiM9Voq0ax1vI8QKFMLXwaLtHwPC7TkFtJbXYNmPt2kXzQ1EjLq0DGiyKd0BFMww3zpEjeSUS7KhCrf5qU7aDjIkVniPs6TGkTBOwG9ItrUv50WJfqgOd6ngHfYWzJFIAgZnGjXhtkHartO3F19iPs5VHRhTEUw2HZbgnTjmf2NmJ1onUkMNSFUMPrNkrfxl78GHNjJrGAkRU0jlXqAuIK8v6uYh4oyqSrFtzPru8GNIWCbka6LKLrMcCysoDk7VQI12xzELxUebiUsnLYCrnvmJtD0T7yv9M8H8rI4YsdbzD3forc56uvwqS0h0Bl6Sw71n781EC0R0V6067RA2TRZ39fs7yXYZ4O5pQ1uHn0qV82aZI1kHxWVJ1omu81KqoFpTnB0QuNd62AeVKRiuMiAf2UFhy40vFgFElRZFipH1TrJAuFYcgwd38kJWTGTYyW51z1DQfVZnlIegEfZQPTjnhFroayXw55MKnJGiVPQ4R0A2nO5LCDmDFmC8SyLz62pn0aQb5tvlQs5Es7woSf3SbxcKh9JndW42V8hQn4uXxbEKhAX9f48VJg5xXYsMOaBz4h0UfsOrOucFH3YVA9c7TVszXbSq7kRQkpsy3xkunDaIfdiAx5wdLE7LbPhUwrC1FWnCa2qQQxNUimxrQ35Woar9tNQSwpVd8ybEaivgQ77HPSYjTdkKX2j2TBCzmGVBesOUnWI9r34kRO5xPsPPDoJvLQe6kns75Yjcuz82OEuUai1PLVRKmzkRjyLp3tt5YDkjzuYCOWNchY3Eup1IEDvGf64wu4S1qLvRrl6HI9jZj7Li2GZc9grCTxbqpUQgCbCxdgmS6a396AJNijmG8uNnchGPlnNVm6DskG7T2pWasVuuhYhkyFNoUWuY5mBXurDMEDyyZPxlY9nlQYKHBNgg6ZnNEYnwCqTLzudDBQ48YG9r3700uvz83jJAX18s2Kjm2LlmOuPJON6rzbPua8Ac2Y0HPuQZD9Ikcim2MOyR9mbvtRTPeLAX3issevCDYaBiG6BFMaN9rW7j1UnlKQZYgTCveE4oH8tT7QGwdWENAjW4kGjS93zCS6QYxyUjg03er43KivMQOVHaT3iznZnQD3Nk5c0T9IKqRpcytY7JaRV7kmayUmKc4d1ApFqY8imlu2iTMiVfY16qMqDeulTtcKjKUuyWBrJSENwv238nWXQudShLeCsiwUMUnJvXyHdSsmAaaoG5O3RA4GkiQVAiX63tWPl6GNfweFAcpxoD4x8hZpbQ1SaBo3pRNwwHuAvzOwm0jKWndfugKqlUmPoDZb9Bx6dzDolUJtHSNYVrOACFY26SyeyeiFHnnd6wZFypkDGL4LgeBbU8TqJTjO2lFXVmCQZjTnO75V43vJKoHUrRSUkZJYTyl3c8tqWJmrUZ7lJo4cUrGzoudgzDHH8N8H73TwXboF8rbQ8i4yb9w51L0EnCd3kdmSXI0PJxuQ9CQ8AD9WKTwvEaSWDTZ"

describe

Contains data
  obs:             1                          
 vars:             1                          
------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
------------------------------------------------------------------------------------
str_var_in_strL strL    %9s                   
------------------------------------------------------------------------------------

save "route\all_in_strL" , replace

Afterward, I applied your modification to the data I generated.

gen var1 = str_var_in_strL

foreach var of varlist var1 {
     generate str `var'_str = `var' 
     replace `var' = "" 
     compress `var' 
     replace `var' = `var'_str  
     drop `var'_str  
     describe `var' 
}

  obs:             1                          
 vars:             2                          16 Dec 2020 12:07
------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
------------------------------------------------------------------------------------
str_var_in_strL strL    %9s                   
var1            strL    %9s                   
------------------------------------------------------------------------------------

save "route\edited_strL" , replace 

Weirdly enough, I was able to import both files into R using the haven library with the following code.

library(haven)
file <- "route_to_first_file/all_in_strL.dta"
# Import first file into R.

dta_in_R <- read_dta(
                    file,
                    encoding = NULL,
                    col_select = NULL,
                    skip = 0,
                    n_max = Inf,
                    .name_repair = "unique")



# Import edited file using your loop method into R.
file <- "route_to_edited_file/edited_strL.dta"

edited_dta_in_R<- read_dta(
                          file,
                          encoding = NULL,
                          col_select = NULL,
                          skip = 0,
                          n_max = Inf,
                          .name_repair = "unique")


The only differences here could be:

  1. The size of the dataset (just one observation in my case).
  2. The possibility that you may be using an earlier version of Stata (I am using Stata 16).

Finally, I think the source of the problem wasn't the strL type of the data, but memory available on your machine, which was probably solved by your compress step in the for loop you described.


PS: Everything was run on Win10. R version 4.0.3 (2020-10-10) and haven_2.3.1

Nick Cox
  • 35,529
  • 6
  • 31
  • 47