0

I have couple hundred of xml files with very simple structure:

NM_000016   3.87178e-09 3:7.8009e-248   14:1.1621e-262

NM_000018   5.49001e-12 9:1.06231e-154  

NM_001032295    2.5366e-10  14:4.46519e-286 30:1.70136e-17  31:1.26709e-268

I want to convert them into csv files which is slightly different from the original one:

NM_000016   3.87178e-09 3  7.8009e-248  14  1.1621e-262

NM_000018   5.49001e-12 9  1.06231e-154 

NM_001032295    2.5366e-10  14  4.46519e-286    30  1.70136e-17 31  1.26709e-268    

The differences is: the ":" within "3:7.8009e-248"is deleted and the numbers"3" "7.8009e-248" are assigned to two columns.

How can I do it efficiently in R? Many Thanks!

AwokeKnowing
  • 7,728
  • 9
  • 36
  • 47
Paul.j
  • 794
  • 2
  • 8
  • 17
  • 1
    Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist. – villecoder Oct 30 '13 at 16:15
  • `XML::xmlToDataFrame` and `utils::write.csv` – Jake Burkhead Oct 30 '13 at 16:20
  • 1
    Your example XML doesn't look like XML. XML usually has lots of signs. – Spacedman Oct 30 '13 at 18:21

2 Answers2

2

as jake said using XML::xmlToDataFrame and utils::write.csv will help if your xml files are just right. If not, you might have to resort to regular expressions to find the text within the tags, build the data frame, then use write.csv

AwokeKnowing
  • 7,728
  • 9
  • 36
  • 47
  • dont't resort to regular expressions. resort to parsing the XML with all the other functions in `package:XML`. http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Jake Burkhead Oct 30 '13 at 16:23
  • yeah, but it seems his "xml" files have columns a data without tags. – AwokeKnowing Oct 30 '13 at 16:26
  • right. by looking at the content of these xml files, they look like txt file without anything like – Paul.j Oct 30 '13 at 17:27
0

Without actually seeing what you're working with, here's one suggestion:

First, some sample data:

X <- tempfile()
cat("NM_000016   3.87178e-09 3:7.8009e-248   14:1.1621e-262",
    "NM_000018   5.49001e-12 9:1.06231e-154",
    "NM_001032295    2.5366e-10  14:4.46519e-286 30:1.70136e-17  31:1.26709e-268",
    sep = "\n", file = X)

Use read.table with fill = TRUE (since your data appear to be unbalanced).

Y <- read.table(X, header = FALSE, fill = TRUE)
Y
#             V1          V2              V3             V4              V5
# 1    NM_000016 3.87178e-09   3:7.8009e-248 14:1.1621e-262                
# 2    NM_000018 5.49001e-12  9:1.06231e-154                               
# 3 NM_001032295 2.53660e-10 14:4.46519e-286 30:1.70136e-17 31:1.26709e-268

Use my "splitstackshape" package to conveniently split the columns into multiple columns.

library(splitstackshape)
Z <- concat.split.multiple(Y, c("V3", "V4", "V5"), ":")
Z
#             V1          V2 V3_1         V3_2 V4_1         V4_2 V5_1         V5_2
# 1    NM_000016 3.87178e-09    3 7.80090e-248   14 1.16210e-262   NA           NA
# 2    NM_000018 5.49001e-12    9 1.06231e-154   NA           NA   NA           NA
# 3 NM_001032295 2.53660e-10   14 4.46519e-286   30  1.70136e-17   31 1.26709e-268

Of course, from here, it's just write.csv to write the new version to a CSV file.

A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
  • in my case: X <-myfile.xml, read.table() works well with such X. Thanks a lot! You folks have been a great help! – Paul.j Oct 31 '13 at 14:32