2

Someone has sent me a .txt file that has no header. Also, there is no structure, so all rows follow the previous one on the same line of the file.

The only thing I know is that every 16 items (separated by “,”) there should be a single row in the final output because there are 16 variables or columns for each plot. Each line of the raw file contains all the measurements for the 16 variables for 438 different plots for one day. In total, the raw file contains 4015 lines (days), with 438x16 items in each line I assume (there might be NAs).

I have managed to read the file as:

x <- readLines("Data/meteodata.dat")
x <- as.data.frame(matrix(as.numeric(unlist(strsplit(x, ","))), ncol = 16, byrow = TRUE))

However, I now would need to aggregate the variables grouping by plot, so instead of such a huge dataset I end up with only 438 rows and 16 columns containing the mean values for each variable and plot. The problem is that the columns do not contain an identifier for each plot to group by. The key is that this messy dataset was generated by someone else from a dataset (y) with 438 rows (one per plot) that does contain plot labels, and in the same order:

> nrow(y)
[1] 438
> head(y)
  CODE_PLOT CODE_COUNTRY
1         1            1
2         1           12
3         1           14
4         1           15
5         1            5
6         1           50

Hence, EACH LINE OF THE MESSY FILE CORRESPONDS WITH THE CODE_COUNTRY AND CODE_PLOT OF “Y” IN THE SAME ORDER.

Thanks and sorry for such an abstract message.

Example of the file before being able to read it:

48.25,4.25,1.989e+07,2.6,5.89,1.28,0.02,0,0,0.42,3575,0,-0.4,2.6,2.57,6.48,50,6,1.989e+07,3.55,5.42,2.31,0.42,0,0,0.15,2420,0,0.27,3.55,2,7.8

Example of the dataset after being able to read it:

> head(test)
    lat  long     date temp.mean temp.max temp.min precip   E0  ES0  ET0 radiation snow.depth
1 48.25  4.25 19890000      2.60     5.89     1.28   0.02 0.00 0.00 0.42      3575       0.00
2    50     6 19890000      3.55     5.42     2.31   0.42 0.00 0.00 0.15      2420       0.00
3 47.75 16.25 19890000      0.67     3.98    -0.92   0.63 0.08 0.00 0.53      5061       0.02
4  69.5    29 19890000    -13.63   -10.06   -20.20   0.10 0.00 0.00 0.02        70      16.56
5 41.75  13.5 19890000      2.05     8.79    -1.72   0.00 0.20 0.06 0.54      8206       0.10
6    47  8.75 19890000     -4.29     2.62    -7.97   0.00 0.00 0.00 0.21      7403       5.45
  water.balance temp.mean2 wind P_hPa
1         -0.40       2.60 2.57  6.48
2          0.27       3.55 2.00  7.80
3          0.10       0.67 3.63  5.17
4          0.08     -13.63 3.65  1.78
5         -0.54       2.05 1.58  6.18
6         -0.21      -4.29 1.22  2.87
fede_luppi
  • 1,063
  • 4
  • 17
  • 29
  • You'll have to manually identify what row of this data belongs to which plot. Then you can `aggregate` and use `cbind` or `merge` to combine with the other data. – Thomas Mar 12 '15 at 06:24
  • I have 438 plots, the whole idea of this is automating the process. I think the key is that each line of the file corresponds to 438 plots in the same order I have in the other dataset I want to link with this and contains the labels of the plots. Perhaps if I could make a list for each line of the file and cbind that line with the clean dataset... – fede_luppi Mar 12 '15 at 08:21

2 Answers2

1
  #messydata.txt : created by copying/pasting the line above into a textfile. 

   #Load Table into R
    data1 <- read.table("messydata.txt", header=FALSE,sep=",", nrows=2, col.names=paste0("C", 1:16) )
   #In col.names you can create the column names you want

       C1   C2       C3   C4   C5   C6   C7   C8 C9  C10  C11 C12   C13  C14  C15  C16
    1 48.25 4.25 19890000 2.60 5.89 1.28 0.02  0  0 0.42 3575   0 -0.40 2.60 2.57 6.48
    2 50.00 6.00 19890000 3.55 5.42 2.31 0.42  0  0 0.15 2420   0  0.27 3.55 2.00 7.80

   #Option 1- Bind your two tables
    cbind(data1, icp)
   #option 2- Join tables if you have a key Variable "ID" 
    require(plyr)
    newdata<- join(x=data1, y=icp, by = "ID")
#The ID can have a different name in x and y. 
Kvasir EnDevenir
  • 907
  • 1
  • 10
  • 25
  • Look for loading .dat files: http://stackoverflow.com/questions/11664075/import-dat-file-into-r – Kvasir EnDevenir Mar 11 '15 at 16:38
  • I don’t know how many rows the file has to include in read.table – fede_luppi Mar 11 '15 at 19:33
  • Because actually it has much more rows that the dataset I want to merge with. Hence, I should read everything, figure out how to include CODE_COUNTRY and CODE_PLOT and then aggregate by CODE_COUNTRY and CODE_PLOT to calculate mean values and use cbind to paste it to the other dataset – fede_luppi Mar 11 '15 at 19:35
0

Try just reading the data in as a single vector (perhaps use readLines or scan) and then convert it to a data.frame via a matrix:

# read in data
x <- "48.25,4.25,1.989e+07,2.6,5.89,1.28,0.02,0,0,0.42,3575,0,-0.4,2.6,2.57,6.48,50,6,1.989e+07,3.55,5.42,2.31,0.42,0,0,0.15,2420,0,0.27,3.55,2,7.8"

# convert to data.frame by organizing as a 2x16 matrix
as.data.frame(matrix(strsplit(x, ",")[[1]], ncol = 16, byrow = TRUE))
##      V1   V2        V3   V4   V5   V6   V7 V8 V9  V10  V11 V12  V13  V14  V15  V16
## 1 48.25 4.25 1.989e+07  2.6 5.89 1.28 0.02  0  0 0.42 3575   0 -0.4  2.6 2.57 6.48
## 2    50    6 1.989e+07 3.55 5.42 2.31 0.42  0  0 0.15 2420   0 0.27 3.55    2  7.8

You will probably want to convert the data to numeric before coercing into a data.frame otherwise you'll get factor variables out of it:

as.data.frame(matrix(as.numeric(strsplit(x, ",")[[1]]), ncol = 16, byrow = TRUE))
##      V1   V2       V3   V4   V5   V6   V7 V8 V9  V10  V11 V12   V13  V14  V15  V16
## 1 48.25 4.25 19890000 2.60 5.89 1.28 0.02  0  0 0.42 3575   0 -0.40 2.60 2.57 6.48
## 2 50.00 6.00 19890000 3.55 5.42 2.31 0.42  0  0 0.15 2420   0  0.27 3.55 2.00 7.80

And if the data match perfectly row by row with your icp data.frame, you can just cbind them together.

Thomas
  • 43,637
  • 12
  • 109
  • 140
  • `> x <- as.data.frame(matrix(strsplit(x, ",")[[1]], ncol = 16, byrow = TRUE)) Error in strsplit(x, ",") : non-character argument` – fede_luppi Mar 11 '15 at 15:27
  • I should have specified on my previous comment that I rad the file like this: `x <- read.table("Data/meteodata.dat")` – fede_luppi Mar 11 '15 at 15:29
  • @fede_luppi You seemed to suggest the file is just a single line of text. If so, read it in as: `x <- readLines("file.txt")` where "file.txt" is the filename. – Thomas Mar 11 '15 at 15:58
  • Ouch! I just realised that it seems that readLines only read the first line of the file! It seems that each row of the file may be a single day, with perhaps as many rows as days in a year! Hence, I would have to read all the lines of the file, figure out hot to include CODE_COUNTRY and CODE_PLOT and aggregate by CODE_COUNTRY and CODE_PLOT to calculate the annual mean of each column – fede_luppi Mar 11 '15 at 19:31
  • When I followed your code I ended up with 438 rows, same as the dataset I want to link. However, the messy data file contains 4015 huge lines! I think the resulting initial 438 rows were just the first line of the file. I should read everything and then aggregate to end up with 438 rows of aggregated data. – fede_luppi Mar 11 '15 at 20:10
  • After running `x <- readLines("file.txt”)` x= Large character (4015 elements, 141Mb). However, after running `as.data.frame(matrix(as.numeric(strsplit(x, ",")[[1]]), ncol = 16, byrow = TRUE))` it seems that only considers the first line – fede_luppi Mar 11 '15 at 20:25
  • How many lines in the file? The part of the code that is `[[1]]` means its only taking the first line. You can remove that and instead use something like: `as.data.frame(matrix(as.numeric(unlist(strsplit(x, ","))), ncol = 16, byrow = TRUE))` – Thomas Mar 11 '15 at 20:35
  • I cannot figure out how to include CODE_COUNTRY and CODE_PLOT to summarise the dataset grouping by them – fede_luppi Mar 11 '15 at 20:47
  • @fede_luppi Your question is becoming much more complex. Please edit your question to include these additional details and requests. – Thomas Mar 11 '15 at 21:58
  • Edits made in the original question – fede_luppi Mar 11 '15 at 22:42