1

I'm trying to read in (and eventually merge/link/manipulate) a series of large (~300M) and very large (~4G) fixed width files for eventual regressions, visualizations, etc., and am hitting some snags.

First, the format of the files themselves is odd - I'm guessing something SQL-y. The file format is referenced here: https://msdn.microsoft.com/en-us/library/ms191479.aspx . It's fixed width, but the last column seems to (sometimes?) cut off with an \r\n before the full fixed width is experienced for that column. For reading it in I've tried laf_open_fwf and data.table::fread, but they both seem to get confused. A sample file and the associated non-XML format descriptor is here. I can't even get the thing to read in properly with that goofy last column. Here's a sample of the file:

1           1           7           7           ER
2           2           9           8           OI
3           54016       1988006     1953409     OI        
4           54017       1988014     1953415     ER        
5           54017       1988014     1953415     OB        

(but note that the CR/LF are invisible here, and it's their odd placement that is the issue. See the above link to the .txt file or png file (which I can't link, low rep) of a notepad++ view of the data to demonstrate the problem with the field.)

Second, file size is an issue. I know I have a lot of table manipulation to do, so I'm tempted to look at data.table... but I also believe data.table stores the entire object in RAM, and that's going to be problematic. LaF or ffdf or sqlite seem like options, though I'm new to them, and would need to cope with this file format issue first.

Some questions get at this general idea, suggesting LaF, ffbase or data.table are below...

Reading big data with fixed width

Quickly reading very large tables as dataframes in R

Speed up import of fixed width format table in R

... but none seems to (1) deal with this odd fixed-width-ish format or (2) move data eventually into data.tables, which seems like I'd like to try first. I thought about trying to open and rewrite them as well-formatted CSVs so data.table could handle them (my goofy hack through data.frames and back to csv feels ridiculous and unscalable, below). And the CSV export demonstates how confused the file gets, since the laf reader is strictly going by field length instead of adjusting based on where the /r/n is...

Currently I'm trying something like the below for starters. Help, if possible?

require("data.table", "LaF", "ffbase")
searchbasis.laf = laf_open_fwf("SEARCHBASIS.txt",
                               column_widths = c(12, 12, 12, 12, 10), 
                               column_names = c("SearchBasisID", "SearchID", "PersonID", "StopID", "Basis"),
                               column_types = rep("string",5),
                               trim = T)
# ^ The laf_open_fwf quietly "fails" because the last column doesn't always 
# have 10 chars, but sometimes ends short with /r/n after the element.
searchbasis.dt = as.data.table(as.data.frame(laf_to_ffdf(searchbasis.laf)))
write.csv(searchbasis.dt, file="SEARCHBASIS.csv")
# ^ To take a look at the file.  Confirms that the read from laf, transfer 
# to data.table is failing because of the last column issue.
Community
  • 1
  • 1
Mike Dolan Fliss
  • 217
  • 2
  • 11
  • The first referenced format description is more like a schema file rather than a data file. You should edit the question to include the first few rows of the actual file. (You also need to include a `library` or `require` call for the `laf_open_fwf`-function. – IRTFM Jun 27 '15 at 21:22
  • Yes, I left the requires out. I assumed people would assume. And yes, the first referenced format is a schema. I'll drop the first few lines in as an example. – Mike Dolan Fliss Jun 27 '15 at 22:21
  • No offense intended with the assumption bit. Should have included the packages, now I have. And I've edited the sample to include the problematic lines... though given the characters in question are invisible, I'm not sure the sample will help much. But, still, there it is. – Mike Dolan Fliss Jun 28 '15 at 03:18
  • Does `as.data.table(as.data.frame(laf_to_ffdf(searchbasis.laf)))` work? If so why would you immediately save that to a csv just to open it again? If it does work then a tweak would be to use setDT instead of as.data.table because the former works by reference as opposed to by copying. – Dean MacGregor Jun 28 '15 at 19:19
  • The reference idea is great, I'll think about that. I dropped it to CSV for two reasons - (1) for an easier scan of the whole file in a text editor (which confirmed that the laf_open_fwf didn't work - the /r/n didn't stop the poorly mixed-width-formatted last column from being read "through" into the next record); and (2) so I could, if it wrote-out properly as CSV, read it back in using data.table. – Mike Dolan Fliss Jun 28 '15 at 21:18
  • Also, this is one of the few times when I think SAS might have an easier job with this - I think it's standard fixed-format reader can stop the read of the last element on an end-of-line. I just... don't want to go back to SAS. :( And I can confirm this is from SQL server fixed-width format. If I could slog through the work of setting the database back up in SQL server, maybe it could read it properly, but I'd really prefer to figure out how to read it direct from R, merge the tables, and do my aggregation from the original database export. – Mike Dolan Fliss Jun 28 '15 at 21:19
  • If there are no missing/empty fields, it will be trivial to convert the file to .csv using sed or awk (or whatever script language) for sed: `s/ +/,/g` – wildplasser Jun 28 '15 at 21:44
  • Wildplasser, I appreciate the suggestion! Wouldn't that sed just replace spaces with a comma? If the field filled it's whole field length, and there was no space between fields, that'd fail, yes? – Mike Dolan Fliss Jun 28 '15 at 22:59
  • @DeanMacGregor, setDT is a great idea. Question - when doing that, R seems to still leave the dataframe reference as well. Can they safely share the same reference, or should I rm() the df? And to confirm, given they now point to the same spot in memory but use different pointers, I assume I'd be safe doing so? – Mike Dolan Fliss Jun 28 '15 at 23:18
  • Sure if you don't need it remove it. Although, in the code above, you have the `as.data.frame` wrapped in `as.data.table` so you never actually have the data.frame in a variable to remove. – Dean MacGregor Jun 28 '15 at 23:36
  • True! We'll see what eventual solution will work, but getting into data table efficiently will surely be part of the solution, though, so really appreciate the edit. setDT inline by itself might enable me to work with the 4g file in memory - if I could read it in correctly, which the above code doesn't. – Mike Dolan Fliss Jun 28 '15 at 23:41

2 Answers2

2

For this particular file:

form <- read.table("SEARCHBASIS_format.txt", as.is = TRUE, skip = 2)
x <- read.table("SEARCHBASIS.txt", col.names = form$V7, as.is = TRUE)

If you sometimes have strings including spaces you'll almost certainly need to process the file externally first.

If you're planning to read really large files I'd suggest (presuming you have awk on your path):

x <- setNames(data.table::fread("awk '{$1=$1}1' SEARCHBASIS.txt"), form$V7)

If you want to use fixed widths you could use:

x <- setNames(fread("gawk 'BEGIN {OFS = \"\t\"; FIELDWIDTHS = \"12 12 12 12 12\"} {for (i = 1; i<= NF; i++) {gsub(/ +$/, \"\", $i);}}1' SEARCHBASIS.txt"), form$V7)

You can also pull the widths from the format file:

x <- setNames(fread(paste0("gawk 'BEGIN {OFS = \"\t\"; FIELDWIDTHS = \"", paste(form$V4, collapse = " "), "\"} {for (i = 1; i<= NF; i++) {gsub(/ +$/, \"\", $i);}}1' SEARCHBASIS.txt")), form$V7)

Note $1=$1 forces awk to reevaluate the fields and the 1 at the end is effectively shorthand for print. I've also assumed you want to strip trailing spaces from each field.

On Windows you'll need to use single quotes in R and replace the single quotes within the command with " and the nested double quotes with "". So the last one above becomes:

x <- setNames(fread(paste0('gawk \"BEGIN {OFS = ""\t""; FIELDWIDTHS = ""', paste(form$V4, collapse = " "), '""} {for (i = 1; i<= NF; i++) {gsub(/ +$/, """", $i);}}1" SEARCHBASIS.txt')), form$V7)

For a cross-platform solution, you'll need to put your awk script in an external file:

stripSpace.awk

BEGIN {OFS="\t"} {for (i = 1; i<= NF; i++) {gsub(/ +$/, "", $i);}}1

R code

x <- setNames(fread(paste0('gawk -v FIELDWIDTHS="', paste(form$V4, collapse = " "), '" -f stripSpace.awk SEARCHBASIS.txt')), form$V7)

Tested on Scientific Linux 6 and Windows 8.1

Nick Kennedy
  • 12,510
  • 2
  • 30
  • 52
  • Nick, thanks for the suggestion. I'll give it a shot using read.table. Not sure how it'll digest the 4gig file. I'm new to awk but heard great things; I'll investigate. Would you mind explaining what that line does? Is that standard regex? – Mike Dolan Fliss Jun 28 '15 at 22:19
  • Thanks for offering something specific. A few notes: while this does work (with "skip=2" for the form file read, since structure starts on line three), I feel like I'm getting lucky here, as you say, since I don't have spaces in early text records and records don't use their entire field length. Also, demonstrates that the 35M txt file hangs in memory at around half that size, so perhaps I'd be better off trying to read these into memory directly as data.frames and cast them by reference to data.tables. – Mike Dolan Fliss Jun 28 '15 at 22:57
  • And on the awk suggestion, could you Definitely explain that? There's a whole page dedicated to understanding why "awk {$1=$1}1 filename" works. :) https://www.physicsforums.com/threads/awk-1-1-1-file-txt.658865/ – Mike Dolan Fliss Jun 28 '15 at 22:58
  • @MikeDolanFliss `fread` will be much faster than `read.table` and goes directly to a `data.table`. I've added a new version using awk which will handle fixed widths and even a version that takes the widths from the format file. – Nick Kennedy Jun 28 '15 at 23:45
  • True, I'd love to use fread. I know fread can take system commands (great use), and I've got gawk in my system path. But I'm hitting errors, maybe system specific: – Mike Dolan Fliss Jun 29 '15 at 00:17
  • @MikeDolanFliss what's the issue? – Nick Kennedy Jun 29 '15 at 00:17
  • Error in fread(paste0("gawk 'BEGIN {OFS = \"\t\"; FIELDWIDTHS = \"", paste(form$V4, : File not found: C:\Users\Mike\AppData\Local\Temp\Rtmp4aYr0r\file1918315f5d22 In addition: Warning messages: 1: running command 'C:\Windows\system32\cmd.exe /c (gawk 'BEGIN {OFS = " "; FIELDWIDTHS = "12 12 12 12 10"} {$1=$1; gsub(/ +/, "")}1' SEARCHBASIS.txt) > C:\Users\Mike\AppData\Local\Temp\Rtmp4aYr0r\file1918315f5d22' had status 1 – Mike Dolan Fliss Jun 29 '15 at 00:18
  • 2: In shell(paste("(", input, ") > ", tt, sep = "")) : '(gawk 'BEGIN {OFS = " "; FIELDWIDTHS = "12 12 12 12 10"} {$1=$1; gsub(/ +/, "")}1' SEARCHBASIS.txt) > C:\Users\Mike\AppData\Local\Temp\Rtmp4aYr0r\file1918315f5d22' execution failed with error code 1 – Mike Dolan Fliss Jun 29 '15 at 00:18
  • It's probably related to being on Windows. I'll try on a Windows machine later today. I'll suggest something that should work above in the meantime. Per http://ahahintechnology.blogspot.co.uk/2009/12/single-quotes-with-awk-in-windows.html?m=1 there might need to be even more double quotes. – Nick Kennedy Jun 29 '15 at 00:21
  • sidenote: If you're feeling regex-strong, also, another method of attack would be to buffer the end of each line with the number of spaces needed pre \r\n. I can only think of how to do that one line-length at a time... but, for instance, line.length should be 58 (sum form$V4). Lines of pre /r/n length of 50, 51, 52... need to be buffered by 8, 7, 6... something like replacing (?<=.+{1,57})\r\n with $1 repeat(' ',58-strlen(submatch(1)) ? I'm not fantastic with regex, but the idea is to 58-n spaces pre \r\n to "fix" the odd lines, so it's a strict fixed width. – Mike Dolan Fliss Jun 29 '15 at 00:28
  • There's a few options for padding lines in vim, awk, etc. here: http://stackoverflow.com/questions/9394408/vim-or-sed-awk-etc-pad-all-lines-with-spaces-to-fixed-width – Nick Kennedy Jun 29 '15 at 00:30
  • ^ Thanks for all the thoughts Nick. Though fread would be ideal, I think, for getting into data.table... The larger files might be best attacked by cleaning into true fixed width. Not sure. Odd that SQL Server seems to export like this. – Mike Dolan Fliss Jun 29 '15 at 00:32
  • @MikeDolanFliss now tested successfully using gawk on Windows 8.1 with the last version above (using extra quotes). For a cross-platform solution, you could use an awk script in a separate file. FILEWIDTHS could then be set using the -v argument. – Nick Kennedy Jun 29 '15 at 00:54
  • Amended to remove trailing spaces only (was removing all spaces) – Nick Kennedy Jun 29 '15 at 01:17
  • First: The above worked, Nick, which is amazing. Very cool. I'll pour over some of the specifics for a while and learn a lot I'm sure. Second: Unfortunately, I'm not sure this solution will work scalably. Tables() shows this string-based data-table is 143 megs compared to the setDT data.table at 25 megs, with the original file of 35megs. I can't afford to read my 4gig file into a format that's potentially 16 gigs before I recast variables and strip spaces. :( Still: this is an awesome 1 line solution for smaller datasets. – Mike Dolan Fliss Jun 29 '15 at 01:25
  • Would it be possible to use a similar method to this linked quesstion to awk to pad the lines with blanks, then pass into laf or read_fwf, to handle the large files? The awk and regex syntax are stumping me. http://stackoverflow.com/questions/3121596/is-it-possible-to-pad-integers-with-zeros-using-regular-expressions – Mike Dolan Fliss Jun 29 '15 at 01:55
  • @MikeDolanFliss that's odd. I'm getting an `object.size` of 13.5 MB. Can you do `sapply(x, class)`? Are you using the updated version of what I posted which properly strips the trailing spaces from each field? The numbers should be integer columns. – Nick Kennedy Jun 29 '15 at 06:42
  • @MikeDolanFliss I've just checked, and I'm almost certain that's the issue. A previous version of my awk script wasn't actually stripping the trailing spaces from the fields and so they were all being imported as character. With the version posted now, the data.table comes in under 16 MB across Linux and Windows. – Nick Kennedy Jun 29 '15 at 10:43
  • Nick, it's been a few days, but if you're still reading this... I've tried this solution with the larger files and still seem to be stumped, but I think awk is likely the way to go. I'd like to give you the answer here... could we use the same r/awk team up to pad the files to proper fwfs, then I have lots of option to attack? – Mike Dolan Fliss Jul 04 '15 at 22:26
  • Something like: format.file.s = "SEARCHBASIS_format.txt" file.s = "SEARCHBASIS.txt" newfile.s = "SEARCHBASIS_ff.txt" format.df <- read.table(format.file.s, skip=2, as.is = TRUE) line.length = sum(format.df["V4"]) #records are meant to be 58 characters pre /r/n. shell.exec(paste0('gawk -v LEN="', line.length, '" -f spacebuff.awk ', file.s, ' > ', newfile.s)) # Then read as true FWF with LaF, base, data.table, or readr. Where spacebuff.awk is something like: {printf "%-LENs\n", $0}. This isn't working, but seems like a simple and robust approach. I'll need to watch memory on the >4g file... – Mike Dolan Fliss Jul 04 '15 at 22:26
  • Been a while - I still haven't managed to get this solution to work for me, but I think the bulk of the response is definitely of value and solves the problem for Nick, who was incredibly helpful in this thread, so I'm awarding a check. It'll likely help others. Thanks again Nick. – Mike Dolan Fliss Aug 31 '15 at 12:33
2

With the recent fixes, fread() can read in rows with multiple spaces without any issues now (v1.9.5+ devel), with it's strip.white argument (=TRUE by default):

require(data.table) # v1.9.5+
fread("1           1           7           7           ER
2           2           9           8           OI
3           54016       1988006     1953409     OI        
4           54017       1988014     1953415     ER        
5           54017       1988014     1953415     OB        
")
#    V1    V2      V3      V4 V5
# 1:  1     1       7       7 ER
# 2:  2     2       9       8 OI
# 3:  3 54016 1988006 1953409 OI
# 4:  4 54017 1988014 1953415 ER
# 5:  5 54017 1988014 1953415 OB

I hope this works for your case. If not, please let us know, and we'll see if we can get that done in fread(). Either upgrade to the devel version (see install instructions on our project page) or wait for the next CRAN version, (as v1.9.6).

Arun
  • 116,683
  • 26
  • 284
  • 387
  • 1
    This is fantastic Arun, thanks. I'll get back to the dataset and see if this upgrade can smooth the reading. Thanks so much! – Mike Dolan Fliss Sep 18 '15 at 14:38
  • Arun: I would love, love to be able to use fread here, so thanks for the suggestion and tweak. I pulled down the development version of fread to try it out on this, but it didn't seem to work "out of the box." It did seem to work on the files that didn't have spaces within the fields (like the example above), but for files with records with spaces in the middle of fixed width fields, I can't seem to get it to work. And I'm getting the "embedded nul in string" error partway way through boot. I'm still trying to wrap my head around the awk work done above. Gah. I'll keep digging. – Mike Dolan Fliss Sep 18 '15 at 18:01
  • Thanks for writing back. Would it be possible for you to file an issue on the GitHub project page? – Arun Sep 18 '15 at 18:03
  • I'd be happy to.... but can I ask, is fread meant to / planning on reading fixed width files? I didn't think there was currently a way to do that in fread (though it would be awesome). I'll submit it anyway! Thanks again... – Mike Dolan Fliss Sep 20 '15 at 14:19