2

I want to write a parser that can read a text file and convert it to Tabular form the text file is in the format:

textfile.txt 
--------------------------------
colA: dataA
colB: dataB
colC: dataC 

ColA: dataA
ColB: dataB
ColC: dataC 

should generated:

ColA  ColB  ColC
dataA dataB dataC
dataA dataB dataC

If anyone can help me, can be a big help as i have search it everywhere but i cannot find the solution.

  • you could use `sep = ":"` in `read.table`, see also the help file: `?read.table` – Jaap Mar 06 '16 at 11:35
  • thanks for correcting the file .. but please look at the sample txt file again ... i tried > mytable <- read.table("SampleMoviesData.txt",sep=":") Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : line 17 did not have 2 elements looks like the space between them is generating the error – AlwaysVictory724 Mar 06 '16 at 12:01
  • Please add the code you tried and the error message to the question instead of posting them in the comments. – Jaap Mar 06 '16 at 12:03
  • maybe adding `fill = TRUE` helps, see also the help file: `?read.table` – Jaap Mar 06 '16 at 12:05
  • doesnt give the required output and the link is not my question please see the edits :( – AlwaysVictory724 Mar 06 '16 at 12:20
  • read.csv , read.table – AlwaysVictory724 Mar 06 '16 at 14:53

1 Answers1

4

As far as I can see, you have 3 problems in your data:

  1. Blank lines in your data file.
  2. Some of the values in the first column start with an uppercase letter, some with lower case.
  3. The data is not in the format you like to see (i.e.: wide format)

This can be solved as follows:

1) Read the data by using the correct separator and the blank.lines.skip parameter (and possibly also fill=TRUE):

mydf <- read.table(text="colA: dataA
colB: dataB
colC: dataC

ColA: dataA
ColB: dataB
ColC: dataC", sep=":", header=FALSE, blank.lines.skip=TRUE)

this gives:

> mydf
    V1     V2
1 colA  dataA
2 colB  dataB
3 colC  dataC
4 ColA  dataA
5 ColB  dataB
6 ColC  dataC

2) Capitalize the values in the first column:

mydf$V1 <- gsub('(^[a-z])','\\U\\1', mydf$V1, perl=TRUE)

3) Reshape to wide format:

library(data.table)
dcast(setDT(mydf), rowid(V1) ~ V1, value.var = 'V2')[, V1 := NULL][]

which gives:

     ColA   ColB   ColC
1:  dataA  dataB  dataC
2:  dataA  dataB  dataC

The above reshaping solution uses the development version (1.9.7) of data.table.

For more alternatives of reshaping your data, see "Transposing Long to Wide without Timevar"

Community
  • 1
  • 1
Jaap
  • 81,064
  • 34
  • 182
  • 193
  • `fread()` with `blank.lines.skip=TRUE` reads this file as expected now ([just pushed a fix](https://github.com/Rdatatable/data.table/commit/124bc1dced7f2ade9098079da1590be5da3033b9). Not sure if you attempted with `fread` first.. – Arun Mar 06 '16 at 18:55
  • @Arun Didn't try with `fread` first, but nice to know that this now works properly :-) (haven't run into bug previously though) – Jaap Mar 06 '16 at 19:45