0

I have to collect data in R, that has been given to me in a xls format, but when I open it with Excel it says that the extension and the format don't match, the file suggests I should save it as a .txt file.

The file I have to use typically contains 3 sections, with different tables in them, which have different sizes and column names. The sections are announced by a title between square brackets. This is a simplified version of my file.

I am only interested in the third section, called '[DATA]'. So far I have manually saved it as an xlsx file and worked my way to use the data I was interested in using read_excel. After reading the whole sheet in R I collected the row where the title '[DATA]' was (it can vary from file to file, I can't select a row number as in readLines), then I could select the table underneath after taking the column names (T, Time, Tension etc.) as my new dataframe's column names. I'd like to be able to do something similar starting from a txt file, because I have a lot of files to work with and they are formatted exactly the same way. I've tried several functions to read the file as a .txt, like 1A = data.table::fread(file, header = F, fill=F, sep = '\t') 2) A = read.delim(file) 3)A = data.frame(readLines(file)) 4) A = read.table(file)

  1. It saves the first table from SETUP and stops early, with this error message "Stopped early on line 25. Expected 24 fields but found 1. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<Number of Duts: 24>>" . If I type fill = TRUE I get the same result as 3.

  2. It makes a big column of all the cells, line after line and cell by cell. It becomes difficult to rearrange the data in a table from there.

  3. It makes a big column again but each line of the file is a cell in the dataframe, and the content of the cell is a string of all the numbers, separated by \t . Example for line 8: experiment1\group1\t0\t7200\t0.001\t"

  4. I get this error message : Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : line 4 did not have 2 elements

I don't know which of these functions is the best lead for this task.

NB : The numbers dispayed in the error messages might be different of what I would get with the example, but I don't even get the error messages with the example (when i made it, Excel required me to put an apostrophy in the cell so the 'minus' sign wouldn't be seen as a formula, so I did. I then saved the file in txt and xls, and even added the xls extension to the txt file to create an incoherence of extension like in my original file. It works in any case.)

Thanks for your help !

Helene78
  • 3
  • 3

2 Answers2

0

You said text file and show a spreadsheet, so I'll demonstrate on a multi-table CSV file:

csvtext <- '[SETUP]
ExpName:
GroupName:
,,
Experiment,Group,Voltage
1,1,1
2,2,2
3,3,3
,,
[RESULT]
Group,Dev,V3
1,1,1
3,3,3
4,4,4
,,
[Data]
"mpg","cyl","disp"
21,6,160
21,6,160
22.8,4,108
'

Read it in as text:

# you may use something like
# rawtext <- readLines("path/to/file.csv")
rawtext <- readLines(textConnection(csvtext))
str(rawtext)
#  chr [1:21] "[SETUP]" "ExpName:" "GroupName:" ",," "Experiment,Group,Voltage" "1,1,1" "2,2,2" "3,3,3" ",," "[RESULT]" ...

We can now split the data based on the "empty" lines, then drop these empty lines:

spltext <- split(rawtext, cumsum(!grepl("[^,\\s]", rawtext)))
spltext <- lapply(spltext, function(z) if (grepl("[^,\\s]", z[1])) z else z[-1])
str(spltext)
# List of 5
#  $ 0: chr [1:3] "[SETUP]" "ExpName:" "GroupName:"
#  $ 1: chr [1:4] "Experiment,Group,Voltage" "1,1,1" "2,2,2" "3,3,3"
#  $ 2: chr [1:5] "[RESULT]" "Group,Dev,V3" "1,1,1" "3,3,3" ...
#  $ 3: chr [1:5] "[Data]" "\"mpg\",\"cyl\",\"disp\"" "21,6,160" "21,6,160" ...
#  $ 4: chr(0) 

(Note that the $ 0 indicates that the name is "0" not 0, so we'll need to use string-numbers for indexing later.)

From here, since you only want the [Data] section, then

read.csv(text = spltext[["3"]][-1])
#    mpg cyl disp
# 1 21.0   6  160
# 2 21.0   6  160
# 3 22.8   4  108
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • Thanks for your answer ! I tried your example step by step and it seems that R reads "mpg","cyl","disp" as one cell and not as the column names. I get " more columns than column names". This is what I get with no header : `V1 V2 V3 1 "mpg","cyl","disp" NA NA 2 21 6 160 3 21 6 160 4 22.8 4 108` – Helene78 Jul 07 '21 at 07:38
  • I also tried with my example data and I have the same issue [2]: i.stack.imgur.com/QAJU3.png . NB : for it to work I used [^,\\s, \t] instead of [^,\\s] because the separations were things like \t or \t\t\t...\t and not commas. – Helene78 Jul 07 '21 at 08:06
  • So it worked if you updated the separator to include `\t`? That's great! Please accept the answer. – r2evans Jul 07 '21 at 12:08
  • FYI, you don't need two commas in your pattern. It's [regex](https://stackoverflow.com/a/22944075/3358272), it's a character-group, meaning any character inside the `[` and `]` will work; so `!grepl("[^,\\s\t]", rawtext)` will work as well. For that matter, you probably won't need the `,`, since it's tab-delimited (not obvious from the question). – r2evans Jul 07 '21 at 13:37
0

I made it work on any of my files (txt) doing this :

rawtext <- readLines(file)

#separation of the sections with an empty line between them spltext <- split(rawtext, cumsum(!grepl("[^,\t]", rawtext)))

#removing the cells coded by a former empty line \t spltext <- lapply(spltext, function(z) if (grepl("[^,\t]", z[1])) z else z[-1])

#The column indexed by 3 is the one that contains the DATA table data=read.delim(text = base[["3"]][-1], header= T, check.names=F) #check.names= F doesn't affect the titles

Helene78
  • 3
  • 3