Trying to figure out how to break up a text file

Question

I have a lengthy series of text files from a project I did about 20 years ago (had to import them from floppy disks!). The original software used FORTRAN and could read the files directly, but I would like to do more efficient manipulations in R. When I read the file into R, you get something along the lines of what you would get if you created the following dataframe:

dataset <-   
as.data.frame(c("R4 8561   200 365801HARLAN     16161616116616166116",  
              "R5 8533   100 472801WHITE      11611111111111111111",  
              "R4 8573   100 485101MCKENNA    11611161161111611161",  
              "R6 8513   200 489801HOLMES     66116111611161111161",  
              "R4 8522   200 492201DAY        11111611111111116111",  
              "R6 8548   100 500901LURTON     11116111911161111111",  
              "R5 8547   100 507322HUGHES     16611111111161116611",  
              "R4 85 3   100 518001VANDEVANTER99999911111111111111",  
              "R5 8553   100 521301LAMAR      99999911111111111111",  
              1910))

This should start out as a 10 x 1 dataframe. I am pulling my hair out trying to do the following:

(1) drop the last row of the dataset, regardless of how long the dataset is. When I do something like dataset <- dataset[-nrow(dataset),] it turns the frame into a factor for some reason; then

(2) drop everything in each cell before the names. The names always begin 21 characters in;

(3) Once I have that, I would like to separate the names (which are always 11 characters long, including spaces if needed) from the numbers (which represent a series of votes);

(4) Once I have that, slice the numbers into individual cells (which will always be 1, 6, or 9). The length of the number will vary from file-to-file.

Any help is greatly appreciated.

1) Just use `data.frame` not `as.data.frame` for your sample data. — Mako212, Aug 12 '18 at 23:05
2) Your `dataset` is already a factor, `[-nrow, ]` isn't changing it. Instead first convert to `as.character` — Mako212, Aug 12 '18 at 23:07
If I may ask: why does R take the dataframe and convert it to a string if I try and remove the last row before converting to characters? Thank you again. — St4096, Aug 12 '18 at 23:27
R also has a `read.fortran` function - see here for an old example question if you know the FORTRAN input parameters - https://stackoverflow.com/questions/28490171/problems-with-importing-fortran-data-in-r/28490811 Other than that, this should be approachable as a `read.fwf` call if you specify all the widths of the columns - https://stackoverflow.com/a/41819624/496803 — thelatemail, Aug 13 '18 at 00:12

Maurits Evers · Answer 1 · 2018-08-13T02:06:17.553

I would strongly advise to use read.fwf, which allows you to read in files with a fixed column width format.

Consider your data consisting of the following lines

lines <-
"R4 8561   200 365801HARLAN     16161616116616166116s
R5 8533   100 472801WHITE      11611111111111111111s
R4 8573   100 485101MCKENNA    11611161161111611161s
R6 8513   200 489801HOLMES     66116111611161111161s
R4 8522   200 492201DAY        11111611111111116111s
R6 8548   100 500901LURTON     11116111911161111111s
R5 8547   100 507322HUGHES     16611111111161116611s
R4 85 3   100 518001VANDEVANTER99999911111111111111s
R5 8553   100 521301LAMAR      99999911111111111111s
1910"

Then we read in the data with a fixed column width format as specified in function argument width (the first column has width 20, the second width 11, and so on).

df <- read.fwf(textConnection(lines), width = c(20, 11, 1, 6, 9))

# Remove first column and last line
df[-nrow(df), -1]
#           V2 V3     V4        V5
#1 HARLAN       1 616161 611661616
#2 WHITE        1 161111 111111111
#3 MCKENNA      1 161116 116111161
#4 HOLMES       6 611611 161116111
#5 DAY          1 111161 111111111
#6 LURTON       1 111611 191116111
#7 HUGHES       1 661111 111116111
#8 VANDEVANTER  9 999991 111111111
#9 LAMAR        9 999991 111111111

Some further comments:

read.fwf takes a file as its first argument, e.g. you should use read.fwf(filename, width = ...).
You may want to trim leading/trailing whitespace with trimws.

Mako212 · Answer 2 · 2018-08-12T23:18:13.683

dataset <-    data.frame(
              test = c("R4 8561   200 365801HARLAN     16161616116616166116",  
              "R5 8533   100 472801WHITE      11611111111111111111",  
              "R4 8573   100 485101MCKENNA    11611161161111611161",  
              "R6 8513   200 489801HOLMES     66116111611161111161",  
              "R4 8522   200 492201DAY        11111611111111116111",  
              "R6 8548   100 500901LURTON     11116111911161111111",  
              "R5 8547   100 507322HUGHES     16611111111161116611",  
              "R4 85 3   100 518001VANDEVANTER99999911111111111111",  
              "R5 8553   100 521301LAMAR      99999911111111111111",  
              1910))

dataset <- dataset[-nrow(dataset), ]

Convert to character:

dataset$test <- as.character(dataset$test)

Strip first 20 characters:

dataset$new <- substr(dataset$test,21,100000)

Get names:

dataset$names <- gsub("^([A-Z]+).*", "\\1", dataset$new)

Get numbers:

dataset$numbers <- gsub(".*?(\\d+)$", "\\1", dataset$new)

There should be an answered question on how to split the numbers into columns, see here

dataset
                                                  test                             new       names              numbers
1  R4 8561   200 365801HARLAN     16161616116616166116 HARLAN     16161616116616166116      HARLAN 16161616116616166116
2  R5 8533   100 472801WHITE      11611111111111111111 WHITE      11611111111111111111       WHITE 11611111111111111111
3  R4 8573   100 485101MCKENNA    11611161161111611161 MCKENNA    11611161161111611161     MCKENNA 11611161161111611161
4  R6 8513   200 489801HOLMES     66116111611161111161 HOLMES     66116111611161111161      HOLMES 66116111611161111161
5  R4 8522   200 492201DAY        11111611111111116111 DAY        11111611111111116111         DAY 11111611111111116111
6  R6 8548   100 500901LURTON     11116111911161111111 LURTON     11116111911161111111      LURTON 11116111911161111111
7  R5 8547   100 507322HUGHES     16611111111161116611 HUGHES     16611111111161116611      HUGHES 16611111111161116611
8  R4 85 3   100 518001VANDEVANTER99999911111111111111 VANDEVANTER99999911111111111111 VANDEVANTER 99999911111111111111
9  R5 8553   100 521301LAMAR      99999911111111111111 LAMAR      99999911111111111111       LAMAR 99999911111111111111

score 0 · Answer 3 · answered Aug 13 '18 at 00:16

Here is one of the approach : use of text functions: nchar - number of characters in string, substr- to split string with start and end character, and strsplit - to split figure into separate columns :

    dataset <- as.data.frame(dataset[-nrow(dataset), ],  stringsAsFactors=F)
    colnames(dataset) <- "text"
    drop20 <- function(x){substr(x, 21, nchar(x))}
    dataset <- as.data.frame(sapply(dataset, drop20), stringsAsFactors=F)
    colnames(dataset) <- "text"

    cellnamesplit <- function(x){substr(x, 1, 11)}
    cellvaluesplit <- function(x){substr(x, 12, nchar(x))}
    cellname <- as.data.frame(sapply(dataset, cellnamesplit), stringsAsFactors=F)
    cellvalue <- as.data.frame(sapply(dataset, cellvaluesplit), stringsAsFactors=F)

    splitvalues <- function(x){strsplit(x, "")}
    valcellall <- t(as.data.frame(apply(cellvalue, 1, splitvalues), stringsAsFactors=F))

    final <- cbind(cellname, valcellall)

Trying to figure out how to break up a text file

3 Answers3