0

Calling read.table() function (on a CSV file), as follows:

  download.file(url, destfile = file, mode = "w")
  conn <- gzcon(bzfile(file, open = "r"))
  try(fileData <- read.table(conn, sep = ",", row.names = NULL), silent = FALSE)

produces the following error:

Error in pushBack(c(lines, lines), file) : 
  can only push back on text-mode connections

I tried to “wrap” the connection explicitly by tConn <- textConnection(readLines(conn)) [and then, certainly, passing tConn instead of conn to read.table()], but it triggered extreme slowness in code execution and eventual hanging or R processes (had to restart R).

UPDATE (That shows again how useful is to try to explain your problems to other people!):

As I was writing this, I decided to go back to documentation and read again on gzcon(), which I thought not only decompresses bzip2 file, but “labels” it as text. But then I realized that it’s a ridiculous assumption, as I know that it’s a text (CSV) file inside the bzip2 archive, but R doesn’t. Therefore, my initial attempt to use textConnection() was the right approach, but something creates a problem. If - and it’s a big IF - my logic is correct until this, the next question is whether the problem is due to textConnection() or readLines().

Please advise. Thank you!

P.S. The CSV files that I'm trying to read are in an "almost" CSV format, so I can't use standard R functions for CSV processing.

===

UPDATE 1 (Program Output):

===

trying URL 'http://flossdata.syr.edu/data/fc/2013/2013-Dec/fcProjectAuthors2013-Dec.txt.bz2'
Content type 'application/x-bzip2' length 514960 bytes (502 Kb)
opened URL
==================================================
downloaded 502 Kb

trying URL 'http://flossdata.syr.edu/data/fc/2013/2013-Dec/fcProjectDependencies2013-Dec.txt.bz2'
Content type 'application/x-bzip2' length 133295 bytes (130 Kb)
opened URL
==================================================
downloaded 130 Kb

trying URL 'http://flossdata.syr.edu/data/fc/2013/2013-Dec/fcProjectDescriptions2013-Dec.txt.bz2'
Content type 'application/x-bzip2' length 5404286 bytes (5.2 Mb)
opened URL
==================================================
downloaded 5.2 Mb

===

UPDATE 2 (Program output):

===

After very long time, I'm getting the following message, then the program continues processing the rest of the files:

Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  : 
  line 1 did not have 8 elements

Then the situation repeats: after processing several smaller (less than 1MB) files, the program "freezes" on processing a larger (> 1MB) file:

trying URL 'http://flossdata.syr.edu/data/fc/2013/2013-Dec/fcProjectTags2013-Dec.txt.bz2'
Content type 'application/x-bzip2' length 1226391 bytes (1.2 Mb)
opened URL
==================================================
downloaded 1.2 Mb

===

UPDATE 3 (Program output):

===

After giving the program more time to run, I discovered the following:

*) My assumption that file size ~1MB plays role in weird behavior was wrong. This is based on the fact that the program successfully processed files with size > 1MB and could not process files with size < 1MB. This is an example output with errors:

trying URL 'http://flossdata.syr.edu/data/fsf/2012/2012-Nov/fsfProjectInfo2012-Nov.txt.bz2'
Content type 'application/x-bzip2' length 826288 bytes (806 Kb)
opened URL
==================================================
downloaded 806 Kb

Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  : 
  line 1 did not have 4 elements
In addition: Warning messages:
1: In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  :
  EOF within quoted string
2: In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  :
  EOF within quoted string

Example with errors processing very small file:

trying URL 'http://flossdata.syr.edu/data/fsf/2012/2012-Nov/fsfProjectRequirements2012-Nov.txt.bz2'
Content type 'application/x-bzip2' length 3092 bytes
opened URL
==================================================
downloaded 3092 bytes

Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  : 
  line 2 did not have 2 elements

From the above examples, it is clear that size is not the factor, but file structure might be.

*) I wrongfully reported the maximum file size, it's 54.2MB compressed. This is the file, which processing not only generates error messages and continues, but it actually triggers an unrecoverable error and stops (exits):

trying URL 'http://flossdata.syr.edu/data/gc/2012/2012-Nov/gcProjectInfo2012-Nov.txt.bz2'
Content type 'application/x-bzip2' length 56793796 bytes (54.2 Mb)
opened URL
=================================================
downloaded 54.2 Mb

Error in textConnection(readLines(conn)) : 
  cannot allocate memory for text connection

*) After emergency exit, five R processes use 51% of memory each, while after manual R restart, this number remains 7% (data per htop report).

Even considering the possibility of "very bad" text/CSV format (suggested by "Error in scan() messages"), the behavior of standard R functions textConnection() and/or readLines() look to me very strange, even "suspicious". My understanding is that good function should process erroneous input data gracefully, allowing very limited time/retries and then continuing processing, if possible, or exiting when further processing is impossible. In this case we see (via the defect ticket screenshot) that R process is taxing both memory and processor of the virtual machine.

Aleksandr Blekh
  • 2,462
  • 4
  • 32
  • 64
  • 1
    How big is the file? If there is no error and it's only slow you might be seeing normal behavior. This [faq](http://stackoverflow.com/questions/1727772/quickly-reading-very-large-tables-as-dataframes-in-r) could be relevant. – Roland Feb 16 '14 at 10:45
  • @Roland: Thanks for the reference. I doubt that's normal, especially when slowness transforms into hanging. The biggest bzip2 file that I need to process is 5.9MB in size. Since it contains a text file, the size of the corresponding uncompressed file is about 6.2MB. The processing takes place on the smallest AWS instance (t1.micro). According to Amazon, memory size for t1.micro is 0.615GB. – Aleksandr Blekh Feb 16 '14 at 10:57
  • Forgot to mention: the code processes first two files fine (meaning, relatively fast), but then hangs on the third. I'll update my question with the program output. – Aleksandr Blekh Feb 16 '14 at 11:25
  • Clarification: when I said "very long time", I meant more than 30 minutes. I have created a software defect description, corresponding to this issue, mostly in order to attach the screenshot of the `htop` output on my server (AWS instance): https://github.com/abnova/diss-floss/issues/1 – Aleksandr Blekh Feb 16 '14 at 11:56
  • Decided to re-read again the documentation on `gzcon()` and I think I might have discovered the reason for all that "strange" behavior: "Compressed output will contain embedded NUL bytes, and so con is not permitted to be a `textConnection` opened with open = "w". Use a writable `rawConnection` to compress data into a variable." Of course, I might as well be wrong and it's not the source of the issues. Any comments? – Aleksandr Blekh Feb 16 '14 at 15:36

2 Answers2

1

When this has happened to me in the past, I get better performance by not using "textConnection". Instead, if I have to do some preprocessing by using 'readLines', I will them write the data to a temporary file and then use that file as input to 'read.table'.

  • Thank you for your reply! I don't process files on-the-fly, if that's what you mean. In fact, I've tried to do that (via RCurl; lines 80-92), but had some issues, so I switched to downloading each file to a temp file and then processing it (lines 93-100). You can see all this code here: https://github.com/abnova/diss-floss/blob/master/import/getFLOSSmoleDataXML.R. – Aleksandr Blekh Feb 16 '14 at 15:46
1

You don't have CSV files. I only looked (yes, actually had a look in a text editor) at one of them but they seem to be tab delimited.

url <- 'http://flossdata.syr.edu/data/fsf/2012/2012-Nov/fsfProjectRequirements2012-Nov.txt.bz2'
file <- "temp.txt.bz2"
download.file(url, destfile = file, mode = "w")
dat <- bzfile(file, open = "r")
DF <- read.table(dat, header=TRUE, sep="\t")
close(dat)

head(DF)
#   proj_num proj_unixname               requirement       requirement_type      date_collected datasource_id
# 1       14          A2ps                    E-mail           Help,Support 2012-11-02 10:57:40           346
# 2       99          Acct                    E-mail           Bug Tracking 2012-11-02 10:57:40           346
# 3      128          Adns    VCS Repository Webview              Developer 2012-11-02 10:57:40           346
# 4      128          Adns                    E-mail                   Help 2012-11-02 10:57:40           346
# 5      196        AmaroK    VCS Repository Webview           Bug Tracking 2012-11-02 10:57:40           346
# 6      196        AmaroK Mailing List Info/Archive Bug Tracking,Developer 2012-11-02 10:57:40           346
Roland
  • 127,288
  • 10
  • 191
  • 288
  • Thank you so much! I guess the reason for my mistake is that I process CSV files from SourceForge repository (in a different R module) and hasn't paid enough attention (lack of sleep?) to the difference. I will change and run the code, after which I'll report results. – Aleksandr Blekh Feb 16 '14 at 16:23
  • Hmmm... It's still processing the files. It maybe a bit faster, but not much - still rather slow. Another warning sign is that during processing `htop` reports CPU usage by an active R process of 30-90%, but much time 100%. Do you think it's normal? UPDATE: Finally finished - processing took about 30 minutes! I also see the `scan` errors for particular lines that I encountered earlier - I guess this is due to "imperfect" structure of the files. – Aleksandr Blekh Feb 16 '14 at 16:47
  • It's also strange that, after the completion of processing, `htop` reports that R processes use 72.3% of memory (0% CPU). Seems to me like a memory leak in my R code. What do you think? – Aleksandr Blekh Feb 16 '14 at 16:55
  • In regard to the `scan` errors I have encountered, I have just found the SO question with explanation and recommendation of potential solutions: http://stackoverflow.com/questions/18161009/error-in-reading-in-data-set-in-r. – Aleksandr Blekh Feb 17 '14 at 08:17
  • Update: the following is the time of running this module's code: user system elapsed 1443.706 157.114 1661.531 – Aleksandr Blekh Feb 17 '14 at 09:42
  • If by "this module's code" you mean the code in my answer, that didn't take more than a few seconds (including the download) yesterday on my system. – Roland Feb 17 '14 at 09:44
  • Update: strangely, `scan` warnings remain, despite using `fill = TRUE` in the `read.table()` call: There were 13 warnings (use warnings() to see them) > warnings() Warning messages: 1: closing unused connection 3 (./tmp7fa64de679f1.bz2) 2: In scan(file, what, nmax, sep, dec, quote, skip, nlines, ... : EOF within quoted string <...skipped identical messages 3-12...> 13: In scan(file, what, nmax, sep, dec, quote, skip, nlines, ... : EOF within quoted string – Aleksandr Blekh Feb 17 '14 at 09:46
  • No, but "module's code" I mean entire code of my module `getFLOSSmoleDataXML.R`, which processes archives of multiple FLOSS repositories from the `FLOSSmole` meta-repository. This module can be found in my GitHub repository: https://github.com/abnova/diss-floss/blob/master/import/getFLOSSmoleDataXML.R. – Aleksandr Blekh Feb 17 '14 at 09:57
  • The code is currently a little messy ("working chaos"), but I plan on cleaning it up after I will have most functionality implemented and tested. – Aleksandr Blekh Feb 17 '14 at 10:00
  • OK. Good that you are making progress, but you really don't need to ping me with regular updates. Your problem is not that interesting to me. – Roland Feb 17 '14 at 10:03
  • Very rough estimate of the total size of all files to be processed is around 100MB, where only 2-3 files comprises a majority of volume: 1-2 files of ~29MB, 1 file of 54MB. However, I'm suspicious of three things: 1) frequently processing of a much larger file takes less time than a smaller one; 2) when processing, CPU usage is hovering around 100% most of the time; 3) `scan` warnings remain despite using `fill = TRUE` as noted above. – Aleksandr Blekh Feb 17 '14 at 10:08
  • Sorry, I meant to place most comments in a general section. – Aleksandr Blekh Feb 17 '14 at 10:12
  • 1
    1) Among other things this depends on number of columns and column type. 2) As it should. 3) You have malformed files. What do you expect? – Roland Feb 17 '14 at 10:12