Here's a dplyr solution that uses a left_join()
...but otherwise relies exclusively on vectorized operations, which are significantly more efficient than looping over large datasets.
While the code might appear lengthy, this is simply a formatting choice: the sake of clarity, I use
foo(
arg_1 = bar,
arg_2 = baz,
# ...
arg_n = qux
)
rather than the one-liner foo(bar, baz, qux)
. Also for the sake of clarity, I will elaborate on the line
# Map each row to its house ID.
House_id = data[row_number()[target][cumsum(target)]],
in the Details section.
Solution
Given a file like subset.txt
reproduced here
H18105265_0
R1_0
Mab_3416311514210525745_W923650.80
T1_0
T2_0
T3_0
V64_0_2_010_ab171900171959
H18117631_0
R1_0
Maa_1240111711220682016_W123650.80
T1_0
V74_0_1_010_aa081200081259_aa081600081859_aa082100095659_aa095700101159_aa101300105059
H18121405_0
R1_0
Mab_2467211713110643835_W923650.80
T1_0
T2_0
V62_0_1_010_090500092459_100500101059_101100101659_140700140859_141100141359
H71811763_0
R1_0
Maa_5325411210120486554_W923650.80
Mab_5325411210110485554_W723650.80
T1_0
T2_0
T3_0
T4_0
and a reference dataset like df
reproduced here
df <- tibble::tribble(
~House_id, ~id, ~new_weight,
18105265, "Mab", 4567,
18117631, "Maa", 3367,
18121405, "Mab", 4500,
71811763, "Maa", 2455,
71811763, "Mab", 2872
)
the following solution
# For manipulating data.
library(dplyr)
# ...
# Code to generate your reference 'df'.
# ...
# Specify the filepath.
text_filepath <- "subset.txt"
# Define the textual pattern for each data item we want, where the relevant
# values are divided into their own capture groups.
regex_house_id <- "(H)(\\d+)(_)(\\d)"
regex_weighted_label <- "(M[a-z]{2,})(_)(\\d+)(_W)(\\d+(\\.\\d+)?)"
# Read the textual data (into a dataframe).
data.frame(data = readLines(text_filepath)) %>%
# Transform the textual data.
mutate(
# Target (TRUE) the identifying row (house ID) for each (contiguous) group.
target = grepl(
# Use the textual pattern for house IDs.
pattern = regex_house_id,
x = data
),
# Map each row to its house ID.
House_id = data[row_number()[target][cumsum(target)]],
# Extract the underlying numeric ID from the house ID.
House_id = gsub(
pattern = regex_house_id,
# The numeric ID is in the 2nd capture group.
replacement = "\\2",
x = House_id
),
# Treat the numeric ID as a number.
House_id = as.numeric(House_id),
# Target (TRUE) the weighted labels.
target = grepl(
# Use the textual pattern for weighted labels.
pattern = regex_weighted_label,
x = data
),
# Extract the ID from (only) the weighted labels.
id = if_else(
target,
gsub(
pattern = regex_weighted_label,
# The ID is in the 1st capture group.
replacement = "\\1",
x = data
),
# For any data that is NOT a weighted label, give it a blank (NA) ID.
as.character(NA)
),
# Extract from (only) the weighted labels everything else but the weight.
rest = if_else(
target,
gsub(
pattern = regex_weighted_label,
# Everything is in the 2nd, 3rd, and 4th capture groups; ignoring the ID
# (1st) and the weight (5th).
replacement = "\\2\\3\\4",
x = data
),
# For any data that is NOT a weighted label, make it blank (NA) for
# everything else.
as.character(NA)
)
) %>%
# Link (JOIN) each weighted label to its new weight; with blanks (NAs) for
# nonmatches.
left_join(df, by = c("House_id", "id")) %>%
# Replace (only) the weighted labels, with their updated values.
mutate(
data = if_else(
target,
# Generate the updated value by splicing together the original components
# with the new weight.
paste0(id, rest, new_weight),
# For data that is NOT a weighted label, leave it unchanged.
data
)
) %>%
# Extract the column of updated values.
.$data %>%
# Overwrite the original text with the updated values.
writeLines(con = text_filepath)
will transform your textual data and update the original file.
Result
The original file (here subset.txt
) will now contain the updated information:
H18105265_0
R1_0
Mab_3416311514210525745_W4567
T1_0
T2_0
T3_0
V64_0_2_010_ab171900171959
H18117631_0
R1_0
Maa_1240111711220682016_W3367
T1_0
V74_0_1_010_aa081200081259_aa081600081859_aa082100095659_aa095700101159_aa101300105059
H18121405_0
R1_0
Mab_2467211713110643835_W4500
T1_0
T2_0
V62_0_1_010_090500092459_100500101059_101100101659_140700140859_141100141359
H71811763_0
R1_0
Maa_5325411210120486554_W2455
Mab_5325411210110485554_W2872
T1_0
T2_0
T3_0
T4_0
Details
The Regex
The text manipulation relies on simply the base functionality of grepl()
(to identify matches) and gsub()
(to extract components). We divide each of our textual patterns regex_house_id
and regex_weighted_label
into their components, as distinguished by capture groups in regular expressions:
# The "H" prefix. The "_" separator.
# | | | |
regex_house_id <- "(H)(\\d+)(_)(\\d)"
# | | | |
# The digits following "H". The "0" suffix (or any digit).
# The digits after the 'id'.
# The 'id': "M" then 2 small letters. | | The weight (possibly a decimal).
# | | | | | |
regex_weighted_label <- "(M[a-z]{2,})(_)(\\d+)(_W)(\\d+(\\.\\d+)?)"
# | | | |
# The "_" separator. The "_" separator and "W" prefix before weight.
We can use grepl(pattern = regex_weighted_label, x = my_strings)
to check which strings in the vector my_strings
match the format for a weighted label (like "Mab_3416311514210525745_W923650.80"
).
We can also use gsub(pattern = regex_weighted label, replacement = "\\5", my_labels)
to extract the weights (the 5th capture group) from a vector my_labels
of labels in that format.
The Mapping
Found within the first mutate()
statement, the line
# Map each row to its house ID.
House_id = data[row_number()[target][cumsum(target)]],
might appear mystifying. However, it is simply a classic arithmetic trick (also employed by @mnist in their solution) to index contiguous values as groups.
The code cumsum(target)
scans over the target
column, which (at this point in the workflow) has logical values (TRUE FALSE FALSE ...
) indicating whether (TRUE
) or not (FALSE
) the text line is a house ID (like "H18105265_0"
). When it hits a TRUE
(numerically a 1
), it increments its running total, while FALSE
(numerically a 0
) leaves the total unchanged.
Since the textual data
column
# |-------------- Group 1 ---------------| |----------- Group 2 ------------| |------------ ...
"H18105265_0" "R1_0" ... "H18117631_0" "R1_0" ... "H18121405_0" ...
has given us the logical target
column
# |-------------- Group 1 ---------------| |----------- Group 2 ------------| |--------- ...
TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE ...
these values (TRUE
and FALSE
) are coerced to numeric (1
and 0
)
# |-------------- Group 1 ---------------| |----------- Group 2 ------------| |--------- ...
1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 ...
to yield the cumsum()
here:
# |-------------- Group 1 ---------------| |----------- Group 2 ------------| |--------- ...
1 1 1 1 1 1 1 2 2 2 2 2 2 3 3 ...
Notice now we've mapped each row to its "group number". So much for cumsum(target)
.
Now for row_number()[target]
! In effect, row_number()
simply "indexes" each position (row)
# |-------------- Group 1 ---------------| |----------- Group 2 ------------| |--------- ...
1 2 ... 8 9 ... 13 ...
in the data
column (or any other column):
# |-------------- Group 1 ---------------| |----------- Group 2 ------------| |------------ ...
"H18105265_0" "R1_0" ... "H18117631_0" "R1_0" ... "H18121405_0" ...
So subscripting those indices by target
# |-------------- Group 1 ---------------| |----------- Group 2 ------------| |--------- ...
TRUE FALSE ... TRUE FALSE ... TRUE ...
selects only those positions with house IDs:
# |-------------- Group 1 ---------------| |----------- Group 2 ------------| |--------- ...
1 8 13 ...
So if we take that result for row_number()[target]
# House ID: 1st 2nd 3rd ...
# Position:
1 8 13 ...
and subscript it by cumsum(target)
# |-------------- Group 1 ---------------| |----------- Group 2 ------------| |--------- ...
1 1 1 1 1 1 1 2 2 2 2 2 2 3 3 ...
we map each row to the position (in data
) of its house ID:
# |-------------- Group 1 ---------------| |----------- Group 2 ------------| |--------- ...
1 1 1 1 1 1 1 8 8 8 8 8 8 13 13 ...
This is the outcome of row_number()[target][cumsum(target)]
.
Finally, when we subscript data
by these (repeated) positions of its house IDs, we get the House_id
column
# |----------------- Group 1 -----------------| |----------------- Group 2 -----------------| |-------------------------- ...
"H18105265_0" "H18105265_0" ... "H18105265_0" "H18117631_0" "H18117631_0" ... "H18117631_0" "H18121405_0" "H18121405_0" ...
where each value in data
is mapped to the house ID of its group.
Thanks to this House_id
column
House_id = data[row_number()[target][cumsum(target)]]
next to our data
column, we can map (left_join()
) the id
s in df
to their corresponding textual data
.