Extract embedded tables

Question

I have a series of csv files that I am trying to analyse. The files have this structure.

"","x"
"1","<HTML>"
"2","<TITLE>Radiosonde Data</TITLE>"
"3","<LINK REL=""StyleSheet"" HREF=""/resources/select.css"" TYPE=""text/css"">"
"4","<BODY BGCOLOR=""white"">"
"5","<H2>15420 LRBS Bucuresti Observations at 00Z 01 Jan 1980</H2>"
"6","<PRE>"
"7","-----------------------------------------------------------------------------"
"8","   PRES   HGHT   TEMP   DWPT   RELH   MIXR   DRCT   SKNT   THTA   THTE   THTV"
"9","    hPa     m      C      C      %    g/kg    deg   knot     K      K      K "
"10","-----------------------------------------------------------------------------"
"11"," 1000.0    -55                                                               "
"12","  981.0     91    1.4    1.4    100   4.34     45     26  276.1  288.0  276.8"
"13","  850.0   1257    3.6    3.6    100   5.86    155     28  289.9  306.9  290.9"
"14","  700.0   2809   -5.1   -5.1    100   3.76    170     44  296.8  308.2  297.5"
"15","  500.0   5350  -24.9  -25.7     93   0.95    180     36  302.6  305.8  302.8"
"16","  400.0   6940  -36.0  -36.9     91   0.41    195     36  308.1  309.6  308.2"
"17","  300.0   8870  -51.5  -52.8     86   0.10    200     46  312.6  313.0  312.7"
"18","  250.0  10050  -60.3  -61.7     83   0.04    205     66  316.3  316.4  316.3"
"19","  238.0  10345  -61.5  -62.9     83   0.03    200     80  319.0  319.1  319.0"
"20","  200.0  11390  -65.7  -67.3     80   0.02    215     60  328.6  328.7  328.6"
"21","  150.0  13170  -60.3  -61.9     81   0.06    225     56  366.0  366.3  366.0"
"22","  100.0  15680  -62.1  -63.8     80   0.07    235     48  407.5  407.9  407.5""
"23","</PRE><H3>Station info</H3><PRE>"
"24","                         Station ID: LRBS"

The aim is to extract the embedded tables (the data between lines 11-22 in the above example) as separate data frames with 11 columns. So far have I have written a script that reads all the text lines (using readLines) and then identifies the start and the end of each table. But in this way the data manipulation is not very easy. I am wondering what will be a more efficient way of extracting the tables.

Thank you very much for all your help.

Please submit a minimal reproducible example. See: [How to create a Minimal, Complete, and Verifiable example](http://stackoverflow.com/help/mcve). — Eric Fail, Jan 21 '16 at 19:13
Is that the exact contents of the file, with all those line numbers and quotes etc, or is that how R reads it in? It looks like its actually an HTML file. Give us the raw file? — Spacedman, Jan 21 '16 at 19:15
might be easier to go back to the original url. Then use the 'rvest' package. see http://www.r-bloggers.com/using-rvest-to-scrape-an-html-table/ — fishtank, Jan 21 '16 at 19:27
This is the exact content of the file, which I believe is an HTML file (or a part of an HTML file) saved a csv file. — user2096665, Jan 21 '16 at 19:28
or since it looks like you are dealing with a fixed width file. look at `read.fwf()` http://stackoverflow.com/questions/14383710/read-fixed-width-text-file — fishtank, Jan 21 '16 at 19:29
@PierreLafortune Yup. Note the `
` tag which means all the spaces are important and line 11 would be ambiguous if it is not. — fishtank, Jan 21 '16 at 19:44
@PierreLafortune Why not? If you strip away the first column which mess up the alignment then parse row 7-22 which contains the table. — fishtank, Jan 21 '16 at 19:53

score 1 · Accepted Answer · answered Jan 21 '16 at 21:25

Given a file called ick.csv in that format:

Read into two columns - ignore all quotes:

txt = read.table("ick.csv",head=FALSE,quote='',sep=',',skip=1, as.is=TRUE)

Now where does the data start? Look for a load of minus signs in the second column:

headlines = grep("------",txt$V2)

There should be two of those. The header labels are on the line after the first one, and the data starts after the second one, and we want all rows up to two before the end. You may need to tweak this if your files have other stuff at the end:

txt = txt[c(headlines[1]+1, (headlines[2]+1):(nrow(txt)-2)),]

Convert the names into a vector:

names = scan(textConnection(gsub('"',"",txt[1,2])),what="")

Now paste the text into one big text string, and get rid of the quotes:

txt = paste0(txt$V2,sep="",collapse="\n")
txt = gsub('"','',txt)

Now treat that as a fixed-width data, and give it names:

d = read.fwf(textConnection(txt),widths=rep(7,11),skip=2)
names(d) = names

Tada!

> d
   PRES  HGHT  TEMP  DWPT RELH MIXR DRCT SKNT  THTA  THTE  THTV
1   981    91   1.4   1.4  100 4.34   45   26 276.1 288.0 276.8
2   850  1257   3.6   3.6  100 5.86  155   28 289.9 306.9 290.9
3   700  2809  -5.1  -5.1  100 3.76  170   44 296.8 308.2 297.5
4   500  5350 -24.9 -25.7   93 0.95  180   36 302.6 305.8 302.8
5   400  6940 -36.0 -36.9   91 0.41  195   36 308.1 309.6 308.2
6   300  8870 -51.5 -52.8   86 0.10  200   46 312.6 313.0 312.7
7   250 10050 -60.3 -61.7   83 0.04  205   66 316.3 316.4 316.3
8   238 10345 -61.5 -62.9   83 0.03  200   80 319.0 319.1 319.0
9   200 11390 -65.7 -67.3   80 0.02  215   60 328.6 328.7 328.6
10  150 13170 -60.3 -61.9   81 0.06  225   56 366.0 366.3 366.0
11  100 15680 -62.1 -63.8   80 0.07  235   48 407.5 407.9 407.5

Extract embedded tables

1 Answers1