This question is related to the following question:
How to parse tab-delimited data (of different formats) into a data.table/data.frame?
I have a text file which is malformed, whereby he tab-delimited format is the following:
A 1092 - 1093 + 1X
B 1093 HRDCPMRFYT
A 1093 + 1094 - 1X
B 1094 BSZSDFJRVF
A 1094 + 1095 + 1X
B 1095 SSTFCLEPVV
...
However, there are several long lines in the text file which are technically tab-delimited, but are long strings. e.g. the rows 'Z' and 'Y' here
Z FX:E:4.2
Y 23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M
A 1092 - 1093 + 1X
B 1093 HRDCPMRFYT
A 1093 + 1094 - 1X
B 1094 BSZSDFJRVF
A 1094 + 1095 + 1X
B 1095 SSTFCLEPVV
...
There is a section of this text file whereby Y 23434M,23434M,...
is possibly several GB long.
These lines are exceptionally rare, and are only labeled by a preceding Z
or Y
. I've currently opened up the file within a text editor and deleted these lines.
However, this is not algorithmically reasonable. Is there a way to parse this file such that either (1) only rows A
and B
are used or (2) rows Z
and Y
are explicitly not used?
EDIT: To clarify, Z is not a long string. Only 'Y' is a long string here. is a string of the format X XX:X:0.0
, whereby X
is a character and 0
an integer.