1

I have a text file in this rather horrendous HTML format:

A<b>Metabolism</b>
B
B  <b>Overview</b>
C    01200 Carbon metabolism [PATH:bpe01200]
D      BP3142 pgi; glucose-6-phosphate isomerase    K01810 GPI; glucose-6-phosphate isomerase [EC:5.3.1.9]
D      BP1971 pgi; glucose-6-phosphate isomerase    K01810 GPI; glucose-6-phosphate isomerase [EC:5.3.1.9]
D      BP1519 fba; fructose-1,6-bisphosphate aldolase   K01624 FBA; fructose-bisphosphate aldolase, class II [EC:4.1.2.13]
D      BP0801 tpiA; triosephosphate isomerase   K01803 TPI; triosephosphate isomerase (TIM) [EC:5.3.1.1]
D      BP1000 gap; glyceraldehyde-3-phosphate dehydrogenase K00134 GAPDH; glyceraldehyde 3-phosphate dehydrogenase [EC:1.2.1.12]

I would like to parse this file into columns in R.

such as:

A,Metabolism
B,
B,Overview
C,01200,Carbon metabolism,Path,bpe01200
D,BP3142,Pgi,glucose-6-phosphate isomerase,GPI,glucose-6-phosphate isomerase,[EC:5.3.1.9]
...
D,BP1000,gap,glyceraldehyde-3-phosphate dehydrogenase,K00134,GAPDH,glyceraldehyde 3-phosphate dehydrogenase,[EC:1.2.1.12]

The problem is that the delimiter changes in each part of the line. It seems to follow this pattern e.g

D      BP1971 pgi; glucose-6-phosphate isomerase    K01810 GPI; glucose-6-phosphate isomerase [EC:5.3.1.9]
 ^Tab        ^space^Semi colon                  ^tab      ^space^semi colon

I can think of the not so smart way to do it.By parsing 1 delimiter at a time. But does anyone have any smart solutions? or know of a tool that can interpret this nicely?

I would really appreciate some help :)

Thanks

  • 1
    If you convert all of the delimiters within a line to the same delimiter, e.g. `tab`, `space` and `semicolon` to just `tab` with a global find and replace, then you only have to deal with just the one delimiter and it should be easy. Note: I have never used R but this technique is pretty simple and works across many systems. Since I don't use R I am not posting this as an answer, but if you test this and it works and you want it as an answer, just let me know and I will post it as an answer. – Guy Coder Jan 10 '17 at 13:49
  • 1
    I would go with regular expressions. – Aurèle Jan 10 '17 at 13:51
  • @GuyCoder interesting idea. giving it a go now, it definately simplifies the problem. but also doesnt fully solve it, as destroys the 3 word strings i have in some fields. – Jonathan Abrahams Jan 10 '17 at 13:54
  • 1
    Good feedback. Then change the delimiters so that the values with spaces are enclosed in quotes. For the starting delimiter `;` change to `tab"` and ending delimiter change `tab` to `"tab`. This should enclose the text with spaces into quoted text and keep them together. You could also massage the data into [CSV](https://en.wikipedia.org/wiki/Comma-separated_values) format with delimiters of `,` and `"` around all values. – Guy Coder Jan 10 '17 at 14:05
  • @GuyCoder woo. Yeah good idea. i think that takes me as far as i can go now. just the space betwenn "BP...." and "pgi" cannot be cleared up in this way. but now the problem is much simpler! `D,BP3142 pgi," glucose-6-phosphate isomerase",K01810 GPI," glucose-6-phosphate isomerase ",[EC:5.3.1.9]` – Jonathan Abrahams Jan 10 '17 at 14:16
  • 1
    If you plan to do this often, then I agree with Apom that using regular expressions is the way to go. Think of these steps as proof of concept, then once you know it works as expected, convert to regular expression. I find doing it this way faster as you don't have to try and debug the regular expression while also learning if the data can be transformed. – Guy Coder Jan 10 '17 at 14:30
  • 1
    Edit of earlier comment. From time to time I actually do this often and run into the same problem. However to get around that I use Microsoft Word on a local computer not over a server as the files I do this on are rather large. What I do is covert as noted and you have done, but then in Word covert the data to a table with the selected delimiter, this puts all of the data with the problem space in the same column. Word allows me to do a find and replace in just that column for which I just change `space` to `"tab"` and then convert back to text. – Guy Coder Jan 10 '17 at 14:31
  • 1
    Edit of earlier comment. I take it that R will allow you convert the text to table, globally replace in just a column and then convert table back into text. I did a quick search and it looks possible, e.g. [Convert text file data into table in R](http://stackoverflow.com/questions/33585344/convert-text-file-data-into-table-in-r) and [Replace a single column values](http://stackoverflow.com/q/5416674/1243762). – Guy Coder Jan 10 '17 at 14:41
  • @JonathanAbrahams You'd be better off asking raw data directly to whomever formatted it that way – Aurèle Jan 11 '17 at 10:06
  • @Apom , i wish it was that easy! This is a powerful biological analysis, but unfortunately it is always presented in this horrible way! Unless you can think of a sneaky way to get this data in an easier way? The way i presented it to you is from 'downloading as htext'. http://www.kegg.jp/kegg-bin/get_htext?bpe00001 – Jonathan Abrahams Jan 11 '17 at 10:16
  • You could write to http://www.kegg.jp/feedback/ – Aurèle Jan 11 '17 at 10:29

3 Answers3

2
library(stringr)
library(purrr)
file <- "A<b>Metabolism</b>
B
B  <b>Overview</b>
C\t01200 Carbon metabolism [PATH:bpe01200]
D\tBP3142 pgi; glucose-6-phosphate isomerase\tK01810 GPI; glucose-6-phosphate isomerase [EC:5.3.1.9]
D\tBP1971 pgi; glucose-6-phosphate isomerase\tK01810 GPI; glucose-6-phosphate isomerase [EC:5.3.1.9]
D\tBP1519 fba; fructose-1,6-bisphosphate aldolase\tK01624 FBA; fructose-bisphosphate aldolase, class II [EC:4.1.2.13]
D\tBP0801 tpiA; triosephosphate isomerase\tK01803 TPI; triosephosphate isomerase (TIM) [EC:5.3.1.1]
D\tBP1000 gap; glyceraldehyde-3-phosphate dehydrogenase\tK00134 GAPDH; glyceraldehyde 3-phosphate dehydrogenase [EC:1.2.1.12]
This line is to check behavior when parsing fails."
cat(file)
data <- readLines(con = textConnection(file))
# Pattern to capture "A<b>Metabolism</b>" for instance
pattern_1 <- "^(\\w+)\\h*<b>\\h*(\\w+)\\h*</b>\\h*$"
# Pattern to capture "B" for instance
pattern_2 <- "^(\\w+)$"
# Pattern to capture "C\t01200 Carbon metabolism [PATH:bpe01200]" for instance
pattern_3 <- "^(\\w+)\\t+(\\w+)\\s+([^\\[\\t;]*)\\h*(\\[[^\\]]*\\])$"
# Pattern to capture "D\tBP3142 pgi; glucose-6-phosphate isomerase\tK01810 GPI; glucose-6-phosphate isomerase [EC:5.3.1.9]" for instance
pattern_4 <- "^(\\w+)\\t+(\\w+)\\s+(\\w+);\\h*([^\\t]*)\\t+(\\w+)\\s+(\\w+);\\h*([^\\[]*)\\h*(\\[[^\\]]*\\])$"
# Some more explanations:
# Parens wrap groups to extract
# "\\w+" matches words
# "\\t+", "\\s+" or ";\\h*" are specific separators of OP's original data
# "([^\\t]*)" matches anything until the next tab separator
# Convoluted patterns such as "(\\[[^\\]]*\\])" extract whatever is inside brackets
patterns <- mget(paste0("pattern_", 1:4))
# A list of the data parsed 4 times, once for each pattern:
patterns %>% 
  map(~ {
    extraction <- str_match(data, .x)
    cbind(match = !is.na(extraction[, 1]), extraction[, - 1])
  })
# This is closer to your desired output: a list of [un]parsed rows:
data %>%
  map(~ {
    # Find the first pattern that matches. 0 if none does
    pattern_index <- detect_index(patterns, grepl, .x, perl = TRUE)
    # If failed to parse, return original row as length 1 character vector. Else return parsed row as character vector
    if (pattern_index == 0L) .x else str_match(.x, get(paste0("pattern_", pattern_index)))[- 1]
  })

Head of output looks like this:

list(c("A", "Metabolism"), "B", c("B", "Overview"), c("C", "01200", 
"Carbon metabolism ", "[PATH:bpe01200]"), c("D", "BP3142", "pgi", 
"glucose-6-phosphate isomerase", "K01810", "GPI", "glucose-6-phosphate isomerase ", 
"[EC:5.3.1.9]"))
Aurèle
  • 12,545
  • 1
  • 31
  • 49
  • I will add comments. This example should work. At least it does on my machine. – Aurèle Jan 10 '17 at 15:39
  • Woah! totally astounded by your level of commitment! Will check it out! – Jonathan Abrahams Jan 10 '17 at 16:25
  • do you have awebsite that i can use to test regex on? specific to R? i cant find one that is specific to R – Jonathan Abrahams Jan 10 '17 at 16:42
  • Regex come in different so-called "flavors". Base R uses `Extended Regular Expressions` (ERE) by default, but all base R functions such as `grepl()` have a `perl = TRUE` argument to use the more common and more powerful `PCRE` (Perl Compatible RE) flavor. `stringr` functions such as `str_match()` use `PCRE`. – Aurèle Jan 10 '17 at 16:47
  • regexr.com says: RegExr uses your browser's RegExp engine for matching, and its syntax highlighting and documentation reflect the JavaScript RegExp standard... So your milage may vary – Aurèle Jan 10 '17 at 16:49
  • My mistake (about `stringr`) : The doc says (`help("stringi-search-regex", package = "stringi")`) they use `ICU` which is similar to PCRE – Aurèle Jan 10 '17 at 16:52
  • I am still workign my way through your script. its such a valuable insight into coding for me. but, i have realised my data contains rows such as this middle row, which miss a value in the third field(as you can see in the second row of this code: C `D\tBP1729 dihydrolipoamide dehydrogenase K00382 DLD; dihydrolipoamide dehydrogenase [EC:1.8.1.4]` This line lacks a third field comapred to most rows. I realise now, i can just add another regex line. will give it a go! – Jonathan Abrahams Jan 10 '17 at 17:17
  • I gave you an upvote. Would be nice to see an example run. :) – Guy Coder Jan 10 '17 at 20:08
  • 1
    Sure :) I added head of final output to my answer. One should be able to run my code on one's machine as a copy-paste, if it doesn't work, let me know – Aurèle Jan 10 '17 at 20:29
  • 2
    Thanks for the example. I asked because I would have to install R to see what it looked like and after seeing the example it cleared up what I was expecting from the code. – Guy Coder Jan 10 '17 at 22:09
1
text <- "
A<b>Metabolism</b>
B
B  <b>Overview</b>
C    01200 Carbon metabolism [PATH:bpe01200]
D      BP3142 pgi; glucose-6-phosphate isomerase    K01810 GPI; glucose-6-phosphate isomerase [EC:5.3.1.9]
D      BP1971 pgi; glucose-6-phosphate isomerase    K01810 GPI; glucose-6-phosphate isomerase [EC:5.3.1.9]
D      BP1519 fba; fructose-1,6-bisphosphate aldolase   K01624 FBA; fructose-bisphosphate aldolase, class II [EC:4.1.2.13]
D      BP0801 tpiA; triosephosphate isomerase   K01803 TPI; triosephosphate isomerase (TIM) [EC:5.3.1.1]
D      BP1000 gap; glyceraldehyde-3-phosphate dehydrogenase K00134 GAPDH; glyceraldehyde 3-phosphate dehydrogenase [EC:1.2.1.12]
"
library(stringr)

# get the header items (beginning with C blank) 
headers <- str_match(text, "C\\s+(.+)\n")[,2]
header_items <- trimws(str_match(headers, "(\\d+)\\s+([^\\[]+)(.+)")[2:4]) 

# get the detail items (liens beginning with D blank)
details <- str_match_all(text, "D\\s+(.+)\n")[[1]][,2]

# parse each item within detail 

# split on ";" and organize into dataframe
items <- as.data.frame(t(data.frame(
  str_split(details,";\\s")
)), row.names = 1:length(details), stringsAsFactors = FALSE)

# parse each part using pattern matches

# capture () beginning of string ^ and all characters not whitespace [^\\s]+
items$V1A <- str_match(items$V1,"(^[^\\s]+)")[,2]

# capture () end of string $ and a non-whitespace sequence [^\\s]+
items$V1B <- str_match(items$V1,"([^\\s]+)$")[,2]

# capture () beginning of string exluding two non-whitespace sequences [^\\s]+ at end $
items$V2A <- str_match(items$V2,"^(.+)\\s[^\\s]+\\s[^\\s]+$")[,2]

# capture () non-whitespace sequence [^\\s]+ at end of string $
items$V2C <- str_match(items$V2,"([^\\s]+)$")[,2]

# capture () second to last non-whitespace sequence [^\\s]+ at end of string $ 
items$V2B <- str_match(items$V2,"([^\\s]+)\\s[^\\s]+$")[,2]

# capture () begining of string ^ excluding last non-whitespace sequence [^\\s]+
items$V3A <- str_match(items$V3,"^(.+)\\s[^\\s]+$")[,2]

# capture () non-whitespace sequence at end $
items$V3B <- str_match(items$V3,"([^\\s]+)$")[,2]

select & reorder
items <- items[, c("V1A", "V1B", "V2A", "V2B", "V2C", "V3A", "V3B")]

items

#     V1A  V1B                                      V2A    V2B   V2C                                      V3A           V3B
#1 BP3142  pgi         glucose-6-phosphate isomerase    K01810   GPI            glucose-6-phosphate isomerase  [EC:5.3.1.9]
#2 BP1971  pgi         glucose-6-phosphate isomerase    K01810   GPI            glucose-6-phosphate isomerase  [EC:5.3.1.9]
#3 BP1519  fba     fructose-1,6-bisphosphate aldolase   K01624   FBA fructose-bisphosphate aldolase, class II [EC:4.1.2.13]
#4 BP0801 tpiA              triosephosphate isomerase   K01803   TPI          triosephosphate isomerase (TIM)  [EC:5.3.1.1]
#5 BP1000  gap glyceraldehyde-3-phosphate dehydrogenase K00134 GAPDH glyceraldehyde 3-phosphate dehydrogenase [EC:1.2.1.12]
Andrew Lavers
  • 4,328
  • 1
  • 12
  • 19
1

And a simpler version of extracting the details only using the same regex strings in one match

text <- "
A<b>Metabolism</b>
B
B  <b>Overview</b>
C    01200 Carbon metabolism [PATH:bpe01200]
D      BP3142 pgi; glucose-6-phosphate isomerase    K01810 GPI; glucose-6-phosphate isomerase [EC:5.3.1.9]
D      BP1971 pgi; glucose-6-phosphate isomerase    K01810 GPI; glucose-6-phosphate isomerase [EC:5.3.1.9]
D      BP1519 fba; fructose-1,6-bisphosphate aldolase   K01624 FBA; fructose-bisphosphate aldolase, class II [EC:4.1.2.13]
D      BP0801 tpiA; triosephosphate isomerase   K01803 TPI; triosephosphate isomerase (TIM) [EC:5.3.1.1]
D      BP1000 gap; glyceraldehyde-3-phosphate dehydrogenase K00134 GAPDH; glyceraldehyde 3-phosphate dehydrogenase [EC:1.2.1.12]
"

library(stringr)

# get the detail items (liens beginning with D blank)
details <- str_match_all(text, "D\\s+(.+)\n")[[1]][,2]

details
pattern <- "([^\\s]+)\\s([^\\s]+);(.*)\\s([^\\s]+)\\s([^\\s]+);\\s(.*)\\s([^\\s]+)$"
trimws(str_match(details, pattern)[,-1])

#[,1]     [,2]   [,3]                                       [,4]     [,5]   
#[1,] "BP3142" "pgi"  "glucose-6-phosphate isomerase"            "K01810" "GPI"  
#[2,] "BP1971" "pgi"  "glucose-6-phosphate isomerase"            "K01810" "GPI"  
#[3,] "BP1519" "fba"  "fructose-1,6-bisphosphate aldolase"       "K01624" "FBA"  
#[4,] "BP0801" "tpiA" "triosephosphate isomerase"                "K01803" "TPI"  
#[5,] "BP1000" "gap"  "glyceraldehyde-3-phosphate dehydrogenase" "K00134" "GAPDH"
#               [,6]                                       [,7]           
#[1,] "glucose-6-phosphate isomerase"            "[EC:5.3.1.9]" 
#[2,] "glucose-6-phosphate isomerase"            "[EC:5.3.1.9]" 
#[3,] "fructose-bisphosphate aldolase, class II" "[EC:4.1.2.13]"
#[4,] "triosephosphate isomerase (TIM)"          "[EC:5.3.1.1]" 
#[5,] "glyceraldehyde 3-phosphate dehydrogenase" "[EC:1.2.1.12]"
Andrew Lavers
  • 4,328
  • 1
  • 12
  • 19