2

I have a text file of around 20 pages with around 200 paragraphs. Each paragraph contains three lines describing information about a person like so:

Name: John
Age: 26
Phone number: 123421

Name: Mary
Age: 80
Phone number: NA

...

Now I wish to convert this large file into a dataframe where the columns represent the three variables Name, Age and Phone number and where the rows correspond to the persons.

Name      Age      Phone number
John      26       123421
Mary      80       NA
...       ...      ...

How can I convert the large text file into such a dataframe?

WBM
  • 129
  • 4

2 Answers2

3

Not pretty but here is a regex option that may work depending on how the data is read in,

test<-
"Name: John
Age: 26
Phone number: 123421

Name: Mary
Age: 80
Phone number: NA
"

This is read in as:

[1] "Name: John\nAge: 26\nPhone number: 123421\n\nName: Mary\nAge: 80\nPhone number: NA\n"

Now using regex to get all matches, always catching NA's to ensure same number of rows:

Names<-regmatches(test, gregexpr("(?<=Name: )[a-zA-Z]+", test, perl=TRUE))

Numbers<-regmatches(test, gregexpr("(?<=Phone number: )[a-zA-Z0-9]+", test, perl=TRUE))

Age<-regmatches(test, gregexpr("(?<=Age: )[a-zA-Z0-9]+", test, perl=TRUE))

df<-data.frame(Names,Numbers,Age)
names(df)<-c("Name","Number","Age")

> df
  Name Number Age
1 John 123421  26
2 Mary     NA  80

Here is how to format the data for this approach if it is read in using read.csv

test<-read.csv(text=test, header=F, stringsAsFactors=FALSE)
test<-list(test$V1)
test<-paste(unlist(test), collapse =" ")
>test
[1] "Name: John Age: 26 Phone number: 123421 Name: Mary Age: 80 Phone number: NA"

If you have last names our regex for the Names argument will need to be changed too:

(?<=Name: ).+?(?=Age)

Chabo
  • 2,842
  • 3
  • 17
  • 32
2

The read.dcf() from base R was built to read this type of data:

read.dcf(textConnection(test), all = TRUE)
  Name Age Phone number
1 John  26       123421
2 Mary  80           NA

A brief description of the DCF ("Debian Control File") format can be found at help("read.dcf").

Data

test <-
"Name: John
Age: 26
Phone number: 123421

Name: Mary
Age: 80
Phone number: NA
"
Community
  • 1
  • 1
Uwe
  • 41,420
  • 11
  • 90
  • 134
  • This is an awesome package I have not heard of until now, this should be considered for the accepted answer. – Chabo Feb 15 '19 at 16:54
  • Thanks, @Chabo. The function is part of base R, so no packages are required. – Uwe Feb 15 '19 at 16:56