0

I have copied the content of a webpage into a .txt file and I would like to read it into R properly. Here is how my data looks with 9 columns:

  5     11111  A, B                       z   L  2800   +25   11  2000.04.13         
  4      2222  C, D                       z   M  2730   -25   30  2001.05.23         
 66       333  E, F                       z   N  2680   +45   23  2002.12.14         
  7     44444  G, H                       z   O  2665    +5   21  2003.03.18         
111         5  I, J                       z   P  2645    +5   38  2004.02.22 

In each row, there is a blank space on the left (before the first column starts) whose length differs for one-digit, two-digit,... numbers in the first column. The letters A,B,C... in the third column are first names and last names (with different lengths), separated with a comma and "one" space (i.e. A, B is the full name of the first person). The between-column separator is different in each row.

Does anybody have an idea how I can read this text into a dataframe with columns correctly specified?

Thank you!

Sohrab
  • 3
  • 2
  • Have you tried `fread("yourtextfile.txt",sep=" ")` from _data.table_ package – pacomet Feb 22 '19 at 14:55
  • Possible duplicate of [What can R do about a messy data format?](https://stackoverflow.com/questions/52023709/what-can-r-do-about-a-messy-data-format) – NelsonGon Feb 22 '19 at 14:58

1 Answers1

0

Try this code, first use fread to read the data. Then unite columns 3 and 4 if needed

library(data.table)
data<-fread("dat.txt",sep=" ")

head(data)
    V1    V2 V3 V4 V5 V6   V7  V8 V9        V10
1:   5 11111 A,  B  z  L 2800  25 11 2000.04.13
2:   4  2222 C,  D  z  M 2730 -25 30 2001.05.23
3:  66   333 E,  F  z  N 2680  45 23 2002.12.14
4:   7 44444 G,  H  z  O 2665   5 21 2003.03.18
5: 111     5 I,  J  z  P 2645   5 38 2004.02.22

library(tidyverse)
data2<-unite_(data, "newcol", c("V3","V4"), sep="")

 head(data2)
    V1    V2 newcol V5 V6   V7  V8 V9        V10
1:   5 11111    A,B  z  L 2800  25 11 2000.04.13
2:   4  2222    C,D  z  M 2730 -25 30 2001.05.23
3:  66   333    E,F  z  N 2680  45 23 2002.12.14
4:   7 44444    G,H  z  O 2665   5 21 2003.03.18
5: 111     5    I,J  z  P 2645   5 38 2004.02.22
pacomet
  • 5,011
  • 12
  • 59
  • 111
  • Thank you for your answer. I have tried fread (with sep=" " and fill = T) but it does not work; I just realized that in some rows there is an extra column at the end. Moreover, some people have a middle name separated with comma and space (e.g. A, B, C). I get the following error: " Expecting 11 cols, but line 28 contains text after processing all cols..." – Sohrab Feb 22 '19 at 15:53