4

This question is related to the following question:

How to parse tab-delimited data (of different formats) into a data.table/data.frame?

I have a text file which is malformed, whereby he tab-delimited format is the following:

A   1092    -   1093    +   1X
B   1093    HRDCPMRFYT
A   1093    +   1094    -   1X
B   1094    BSZSDFJRVF
A   1094    +   1095    +   1X
B   1095    SSTFCLEPVV
...

However, there are several long lines in the text file which are technically tab-delimited, but are long strings. e.g. the rows 'Z' and 'Y' here

Z  FX:E:4.2
Y   23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M 
A   1092    -   1093    +   1X
B   1093    HRDCPMRFYT
A   1093    +   1094    -   1X
B   1094    BSZSDFJRVF
A   1094    +   1095    +   1X
B   1095    SSTFCLEPVV
...

There is a section of this text file whereby Y 23434M,23434M,... is possibly several GB long.

These lines are exceptionally rare, and are only labeled by a preceding Z or Y. I've currently opened up the file within a text editor and deleted these lines.

However, this is not algorithmically reasonable. Is there a way to parse this file such that either (1) only rows A and B are used or (2) rows Z and Y are explicitly not used?

EDIT: To clarify, Z is not a long string. Only 'Y' is a long string here. is a string of the format X XX:X:0.0, whereby X is a character and 0 an integer.

ShanZhengYang
  • 16,511
  • 49
  • 132
  • 234
  • as a very quick check , could you do `nchar(rd)` (from previous Q), and exclude the rows with > X number of characters – user20650 May 13 '18 at 12:53
  • `Z` isn't a long string, how are you planing to identify it? – David Arenburg May 13 '18 at 13:09
  • One option is to use `read.delim2` with `comment.char = "Y"` to skip/ignore rows starting with `Y`. But it will not work in you case as your data got `Y` character in other field. – MKR May 13 '18 at 13:12
  • 3
    You can delete long lines using `sed` (from previous question), e.g. `fread("sed -e '/^.\\{100\\}./d' -e '$!N;s/\\n/ /' test.tab")` though I still don't understand how `Z` us a long line. It would be nice if you were a bit more responsive. – David Arenburg May 13 '18 at 13:21
  • @DavidArenburg Sorry, offline. You are correct; it is unclear above. Z is a string of the format `X XX:X:0.0`, whereby X is a character and `0` an integer. – ShanZhengYang May 13 '18 at 21:53
  • @MKR I think this *is* a good way to ignore 'Z' though, right? – ShanZhengYang May 13 '18 at 21:56

1 Answers1

3

You can make a system call in order to fix the file in place using, let's say sed, by a certain pattern. If you want to remove all the rows that begin with Z or Y you can simply pass a regex expression followed by /d

system("sed -i '/^[ZY]/d' test.tab")

The command above will remove all the rows that begin with Z or Y from you file. Then, you can run the same code I've posted in your previous question

library(data.table)
fread("sed '$!N;s/\\n/ /' test.tab")
#    V1   V2 V3   V4 V5   V6   V7         V8
# 1:  A 1092  - 1093  + 1X B 1093 HRDCPMRFYT
# 2:  A 1093  + 1094  - 1X B 1094 BSZSDFJRVF
# 3:  A 1094  + 1095  + 1X B 1095 SSTFCLEPVV

Data

text <- "Z FX:E:4.2
Y  23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M 
A   1092    -   1093    +   1X
B   1093    HRDCPMRFYT
A   1093    +   1094    -   1X
B   1094    BSZSDFJRVF
A   1094    +   1095    +   1X
B   1095    SSTFCLEPVV"

# Saving it as tab separated file on disk
write(gsub(" +", "\t", text), file = "test.tab")
David Arenburg
  • 91,361
  • 17
  • 137
  • 196