How to adjust separator to be multiple spaces rather than single space to read dataframe into r?

Question

I need to read the following table into R. This table is an output from prop.table(table(x,y),2) . The problem is in sep ="\t" . I tried sep ="" and sep =c("\t","") but none of them work as some rows have space as well.

I tried the following code:

dataMonthlyTrend<-read.table(text= " Variable Non_stopped_percentage stopped_percentage
INDIV                                  1.0000000       0.0000000000
INDUSTRY                               0.9987045       0.0012955466
NETWORK                                0.9990512       0.0009487666
OTHER                                  0.9997127       0.0002872679
Early Phase 1                          0.9981618       0.0018382353
Not.Applicable                         0.9996305       0.0003694809
Phase.1                                0.9993608       0.0006392499
Phase 1, Phase 2                       1.0000000       0.0000000000
Phase.2                                0.9990993       0.0009006980
Phase.2, Phase 3                       0.9990329       0.0009671180
Phase 3                                0.9991403       0.0008596974
Phase 4                                0.9995730       0.0004269855
Observational                          0.9997154       0.0002846084
Expanded Access                        1.0000000       0.0000000000
Interventional                         0.9994374       0.0005625766
Behavioral                             0.9998005       0.0001994813
Biological                             0.9995493       0.0004506534
Combination Product                    1.0000000       0.0000000000
Device                                 0.9991869       0.0008131403
Diagnostic Test                        1.0000000       0.0000000000
More than 1 type                       0.9992554       0.0007446016
Other                                  0.9996144       0.0003855546
Procedure                              1.0000000       0.0000000000
Radiation                              1.0000000       0.0000000000
Case-Control                           0.9996120       0.0003880481
Case-Crossover                         1.0000000       0.0000000000
Case-Only                              0.9996069       0.0003930818
Cohort                                 0.9996924       0.0003075977
Defined Population                     1.0000000       0.0000000000
Ecologic or Community                  1.0000000       0.0000000000
Family-Based                           1.0000000       0.0000000000
Natural History                        1.0000000       0.0000000000
Other                                  1.0000000       0.0000000000
Non-Probability Sample                 0.9997578       0.0002422481
Africa                                 1.0000000       0.0000000000
Asia                                   0.9998925       0.0001075038
Europe                                 0.9998773       0.0001227220
More than 1 continent                  0.9998412       0.0001587554
North America                          0.9994576       0.0005423974
Oceania                                0.9969970       0.0030030030
South America                          1.0000000       0.0000000000
", sep="\t", header=T);dataMonthlyTrend

This is what I am getting using the prior code

Then I used this code shown in the screenshot below and there were no columns headings although I entered them as per @G.Grothendiec 's code

My current R version is 3.6.2 (2019-12-12).

Any advice will be greatly appreciated.

G. Grothendieck · Answer 1 · 2021-01-21T17:55:07.870

1) read.table Assuming Lines shown in the Note at the end and then replace multiple whitespace characters with semicolons giving L. Then read L except the header and read in the header separately setting col.names.

L <- gsub("\\s{2,}", ";", Lines)
DF <- read.table(text = L, sep = ";", skip = 1, strip.white = TRUE,
  col.names = read.table(text = L, nrow = 1))

The result looks like this:

> str(DF)
'data.frame':   41 obs. of  3 variables:
 $ Variable              : chr  "INDIV" "INDUSTRY" "NETWORK" "OTHER" ...
 $ Non_stopped_percentage: num  1 0.999 0.999 1 0.998 ...
 $ stopped_percentage    : num  0 0.001296 0.000949 0.000287 0.001838 ...

2) read.pattern

Another approach is to use read.pattern in the gsubfn package. This can do it all at once instead of needing a separate step to replace the separators.

library(gsubfn)
DF <- read.pattern(text = Lines, pattern = "^(.*)\\s+(\\S+)\\s+(\\S+)$", 
  skip = 1, col.names = read.table(text = Lines, nrow = 1))

Note

Lines <- " Variable Non_stopped_percentage stopped_percentage
INDIV                                  1.0000000       0.0000000000
INDUSTRY                               0.9987045       0.0012955466
NETWORK                                0.9990512       0.0009487666
OTHER                                  0.9997127       0.0002872679
Early Phase 1                          0.9981618       0.0018382353
Not.Applicable                         0.9996305       0.0003694809
Phase.1                                0.9993608       0.0006392499
Phase 1, Phase 2                       1.0000000       0.0000000000
Phase.2                                0.9990993       0.0009006980
Phase.2, Phase 3                       0.9990329       0.0009671180
Phase 3                                0.9991403       0.0008596974
Phase 4                                0.9995730       0.0004269855
Observational                          0.9997154       0.0002846084
Expanded Access                        1.0000000       0.0000000000
Interventional                         0.9994374       0.0005625766
Behavioral                             0.9998005       0.0001994813
Biological                             0.9995493       0.0004506534
Combination Product                    1.0000000       0.0000000000
Device                                 0.9991869       0.0008131403
Diagnostic Test                        1.0000000       0.0000000000
More than 1 type                       0.9992554       0.0007446016
Other                                  0.9996144       0.0003855546
Procedure                              1.0000000       0.0000000000
Radiation                              1.0000000       0.0000000000
Case-Control                           0.9996120       0.0003880481
Case-Crossover                         1.0000000       0.0000000000
Case-Only                              0.9996069       0.0003930818
Cohort                                 0.9996924       0.0003075977
Defined Population                     1.0000000       0.0000000000
Ecologic or Community                  1.0000000       0.0000000000
Family-Based                           1.0000000       0.0000000000
Natural History                        1.0000000       0.0000000000
Other                                  1.0000000       0.0000000000
Non-Probability Sample                 0.9997578       0.0002422481
Africa                                 1.0000000       0.0000000000
Asia                                   0.9998925       0.0001075038
Europe                                 0.9998773       0.0001227220
More than 1 continent                  0.9998412       0.0001587554
North America                          0.9994576       0.0005423974
Oceania                                0.9969970       0.0030030030
South America                          1.0000000       0.0000000000
"

Thx for your precious input. I used your code exactly as you said but it is missing headers. Please let me know if I should edit something in that. Upvoted. — Mohamed Rahouma, Jan 21 '21 at 17:47
The code shown reads the headers as well. as you can see by the output. — G. Grothendieck, Jan 21 '21 at 17:49
Thx. I see but I am not sure why here it did not show them even after using your 2nd approach. Here is the output using 'str(DF)''data.frame': 41 obs. of 3 variables: $ X1 : Factor w/ 40 levels "Africa","Asia",..: 18 19 24 31 13 27 35 32 36 37 ... $ X1.1: num 1 0.999 0.999 1 0.998 ... $ X1.2: num 0 0.001296 0.000949 0.000287 0.001838 ... — Mohamed Rahouma, Jan 21 '21 at 17:53
Both approaches work with `Lines`. The second approach is not due to any deficiency in the first approach. It is just another alternative. Presumably there is some difference between your data and what you showed in the question. Please show the first few lines to better understand the difference. — G. Grothendieck, Jan 21 '21 at 18:00
Also are yoo using an old version of R? read.table does not produce factor output with R 4.0+ . — G. Grothendieck, Jan 21 '21 at 18:02
I can't test that since it is an image. I had updated the code a few times. Although I would not think it would make a difference try the latest version in the answeer just in case. Also update your version of R to the latest version. also check if there are any weird unprintable characters in your input. See https://stackoverflow.com/questions/34613761/detect-non-ascii-characters-in-a-string — G. Grothendieck, Jan 21 '21 at 19:03

score 0 · Answer 2 · answered Jan 21 '21 at 17:44

The read.table function (and other similar functions) only allow sep to be a character to match on, not a regular expression, which is what you need.

So, one approach is to put your data into a character string, then preprocess it to change the delimiter, then pass that result to read.table (if reading from a file, you can use readLines to read into a character vector for preproccessing.

Here is some code that does most of what you describe (I shortened the data):

text.data <- " Variable Non_stopped_percentage stopped_percentage
INDIV                                  1.0000000       0.0000000000
INDUSTRY                               0.9987045       0.0012955466
South America                          1.0000000       0.0000000000"

text.data2 <- gsub("[[:space:]]{2,}|\\t", ":", text.data)
read.table(text=text.data2, sep=':', skip=1)

Here I store the data in text.data, then use gsub to replace the pattern of 2 or more whitespace characters OR a single tab character with a colon (could have used other separator, but colon was one character that was not already in the data). Then pass the processed data to read.table with sep=':'. I skipped the first line because the first row (column names) were separated by a single space (and had an extra space at the beginning), so you will need to set the names in an additional step.

However, since this is the output from prop.table it is probably better to just save the results of prop.table rather than printing the results and copying and pasting them to additional code.

How to adjust separator to be multiple spaces rather than single space to read dataframe into r?

2 Answers2

Note