0

I want to read a file as described at

http://snap.stanford.edu/data/wiki-RfA.html

into a data frame in R.

I know the function read.table but I think it works only with vertical table.

How should I read a file like above.

The file format is:

SRC:Guettarda
TGT:Lord Roem
VOT:1
RES:1
YEA:2013
DAT:19:53, 25 January 2013
TXT:'''Support''' per [[WP:DEAL]]: clueful, and unlikely to break Wikipedia.

So I want to read the file into a dataframe with 7 columns SRC, TGT, ... TXT.

mommomonthewind
  • 4,390
  • 11
  • 46
  • 74
  • You can try `readLines` to read each line and then parse it out and use `rbind` and `cbind` to create a data frame. – Gopala Apr 07 '16 at 13:00
  • Thanks @Gopala, but is there a faster way and more elegant way to do that? – mommomonthewind Apr 07 '16 at 13:06
  • Unless there is a package written for this specific file format, I don't see how a general purpose thing can help. You can look at the `readr` package for some things, but I don't imagine it will solve this specific problem. – Gopala Apr 07 '16 at 13:07
  • Check this out http://stackoverflow.com/questions/21891841/importing-only-every-nth-row-from-a-csv-file-in-r – chinsoon12 Apr 07 '16 at 13:08
  • Hi @chinsoon12, thanks for your answer but I could not see the link between two questions – mommomonthewind Apr 07 '16 at 13:16
  • Hi @Gopala, I thought this file format is just a transpose version of CSV format, isn't it? Maybe I am wrong? – mommomonthewind Apr 07 '16 at 13:20

2 Answers2

0

here is a method using readLines

dataStartPosn <- 5
nfields <- 7
TXTmaxLen <- 1e3
eachColnameLen <- 3

#download and read lines
temp <- tempfile()
download.file("http://snap.stanford.edu/data/wiki-RfA.txt.gz",temp)
dataLines <- readLines(gzfile(temp, "r"))

library(plyr)
library(stringi)

#extract data
data <- stri_sub(dataLines, dataStartPosn, length=TXTmaxLen)

#extract colnames
colnames <- unname(sapply(dataLines[1:(nfields+1)], function(x) substring(x, 1, eachColnameLen)))

#form table
df <- data.frame(do.call(rbind, split(data, ceiling(seq_along(data)/(nfields+1)))))

#formatting
df <- setNames(df, colnames)
df[-(nfields+1)]

Alternative method mentioned in comments was too slow

SRC <- read.csv(pipe("sed -n '1~8p' wiki-RfA.txt"))
TGT <- read.csv(pipe("sed -n '2~8p' wiki-RfA.txt"))
chinsoon12
  • 25,005
  • 4
  • 25
  • 35
-1

Here is elegant solution. I saved your example to ascii file "testdat". One thing you might want to consider first is that your delimiter also crops up in your data. This makes handling the data more difficult, and it should be fairly trivial for you to change this prior to writing the data in. I changed it to this...

SRC;Guettarda

TGT;Lord Roem

VOT;1

RES;1

YEA;2013

DAT;19:53, 25 January 2013

TXT;'''Support''' per [[WP:DEAL]]: clueful, and unlikely to break Wikipedia.

i.e. replaced the delimiting colons with semi-colons.

Then it's easy,

t<-read.table("testdat", stringsAsFactors=F, sep=";")

p=as.data.frame(t(t$V2), stringsAsFactors=F) 

colnames(p)<-t$V1

then p is what you want

RHA
  • 3,677
  • 4
  • 25
  • 48
burke
  • 1
  • 2