2

I have a large txt file containing specific format structure. My goal is to load text in R using readLines and I want to replace weight value of each record with new value based on my df data frame. I don't want to change the .txt data structure format or parse the .txt file. Final output should have the exact same structure as the original .txt (writeLines()). How can I read it and update the value? Thanks

Here is my reference data frame

df <- tibble::tribble(
        ~House_id,  ~id, ~new_weight,
  18105265, "Mab",        4567,
  18117631, "Maa",        3367,
  18121405, "Mab",        4500,
  71811763, "Maa",        2455,
  71811763, "Mab",        2872
  ) 

Here is a small subset of my .txt

H18105265_0
R1_0
Mab_3416311514210525745_W923650.80
T1_0
T2_0
T3_0
V64_0_2_010_ab171900171959
H18117631_0
R1_0
Maa_1240111711220682016_W123650.80
T1_0
V74_0_1_010_aa081200081259_aa081600081859_aa082100095659_aa095700101159_aa101300105059
H18121405_0
R1_0
Mab_2467211713110643835_W923650.80
T1_0
T2_0
V62_0_1_010_090500092459_100500101059_101100101659_140700140859_141100141359
H71811763_0
R1_0
Maa_5325411210120486554_W923650.80
Mab_5325411210110485554_W723650.80
T1_0
T2_0
T3_0
T4_0

Here the the desire output for first individual record house_id = 18105265: Update Mab_3416311514210525745_W923650.80 line with new value Mab_3416311514210525745_W4567 base of df

H18105265_0
R1_0
Mab_3416311514210525745_W4567
T1_0
T2_0
T3_0
V64_0_2_010_ab171900171959
DanG
  • 689
  • 1
  • 16
  • 39

5 Answers5

4

EDIT - added id to lookup to distinguish between non-unique House_id.

Here's an approach where I read the data, join to the updated weights in df, and then create an updated value on the rows that start with "M" using that new weight.

library(tidyverse)
read_fwf("txt_sample.txt" ,  col_positions = fwf_empty("txt_sample.txt")) %>% # edit suggested by DanG

# if the row starts with H, extract 8 digit house number and
# use that to join to the table with new weights
mutate(House_id = if_else(str_starts(X1, "H"), as.numeric(str_sub(X1, 2,9)), NA_real_),
       id = if_else(str_starts(X1, "M"), str_sub(X1, 1,3), NA_character_)) %>%
fill(House_id) %>%
left_join(df, by = c("House_id", "id")) %>%
fill(new_weight) %>%

# make new string using updated weight (or keep existing string)
mutate(X1_new = coalesce(
  if_else(str_starts(X1, "M"),
          paste0(word(X1, end = 2, sep = "_"), "_W", new_weight),
          NA_character_),
  X1)) %>%

pull(X1_new) %>% 
writeLines()

Output

H18105265_0
R1_0
Mab_3416311514210525745_W4567
T1_0
T2_0
T3_0
V64_0_2_010_ab171900171959
H18117631_0
R1_0
Maa_1240111711220682016_W3367
T1_0
V74_0_1_010_aa081200081259_aa081600081859_aa082100095659_aa095700101159_aa101300105059
H18121405_0
R1_0
Mab_2467211713110643835_W4500
T1_0
T2_0
V62_0_1_010_090500092459_100500101059_101100101659_140700140859_141100141359
H71811763_0
R1_0
Maa_5325411210120486554_W2455
Mab_5325411210110485554_W2872
T1_0
T2_0
T3_0
T4_0
Jon Spring
  • 55,165
  • 4
  • 35
  • 53
2

You can try the following base R code

writeLines(
  do.call(
    paste0,
    lapply(
      unlist(
        strsplit(
          readChar("test.txt", file.info("test.txt")$size),
          "(?<=\\d)\n(?=H)",
          perl = TRUE
        )
      ),
      function(x) {
        with(
          df,
          Reduce(
            function(x, ps) sub(ps[[1]], ps[[2]], x),
            asplit(rbind(
              unlist(regmatches(x, gregexpr("W.*(?=\n)", x, perl = TRUE))),
              paste0("W", new_weight[sapply(sprintf("H%s.*%s_\\d+_W", House_id, id), grepl, x)])
            ), 2),
            init = x
          )
        )
      }
    )
  )
)

which gives

H18105265_0
R1_0
Mab_3416311514210525745_W4567
T1_0
T2_0
T3_0
V64_0_2_010_ab171900171959
H18117631_0
R1_0
Maa_1240111711220682016_W3367
T1_0
V74_0_1_010_aa081200081259_aa081600081859_aa082100095659_aa095700101159_aa101300105059
H18121405_0
R1_0
Mab_2467211713110643835_W4500
T1_0
T2_0
V62_0_1_010_090500092459_100500101059_101100101659_140700140859_141100141359
H71811763_0
R1_0
Maa_5325411210120486554_W2455
Mab_5325411210110485554_W2872
T1_0
T2_0
T3_0
T4_0

break down codes

  • We first use the following code to divide the long string into smaller blocks
      unlist(
        strsplit(
          readChar("test.txt", file.info("test.txt")$size),
          "(?<=\\d)\n(?=H)",
          perl = TRUE
        )
      )
  • For the sub-string in each block, we find the matched House_id + id, and replace the weight part, e.g., Wxxxxxx with corresponding new_weight value
        with(
          df,
          Reduce(
            function(x, ps) sub(ps[[1]], ps[[2]], x),
            asplit(
              rbind(
              unlist(regmatches(x, gregexpr("W.*(?=\n)", x, perl = TRUE))),
              paste0("W", new_weight[sapply(sprintf("H%s.*%s_\\d+_W", House_id, id), grepl, x)])
            ), 2),
            init = x
          )
        )

Note that the last block has two different matched id, we use Reduce to replace the weights iteratively

ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81
1

You have to loop over the various lines you obtain after a readlines of your text document. You can parse the House_id from the lines starting with H using hpatt = 'H[0-9]+_0' as regular expression, and then apply the stringr package to process lines:

for (i in 1:length(lines)){
  line = lines[[i]]

  #detect if line looks like 'H[number]_0'
  if (stringr::str_detect(line, hpatt)){
    #if it does, extract the 'house_id' from the line
    h_id = stringr::str_extract(test, pattern = 'H[0-9]+') %>% 
      stringr::str_replace('H|_0','')
  }

In the second part, you can replace the original weight with one obtained from your tibble (which I've dubbed replacetibble here). I'm using the regex mpatt = '^[a-zA-z]+_[0-9]+_W[0-9\\.]+$', which looks for strings that look like [character-onlyname]_[number]_W[numberwithdecimal]:

  if (stringr::str_detect(line, mpatt)){
    # split string to get 'id'
    id = stringr::str_split(line, '_')[[1]][[1]]
    # look up weight
    wt = (replacetibble %>% filter(house_id==h_id & id == id) %>% select(weight))
    # replace number in line, split the original line by the 'W'
    # this will of course break if your id contains a W - please
    # adapt logic according to your naming rules
    replaceline = stringr::str_split(line, 'W')[[1]]
    replaceline[length(replaceline)] =wt
    # put the line back together with a 'W' character
    lines[[i]] = paste0(replaceline, collapse = 'W')
  }
}

Stringr (cheat sheet here) is generally quite powerful in manipulating strings.

I'll leave the loading and saving part up to you.

  • it should be [H]_[id]_w[weight]. H refers to house_Id and id refers to individual id. I updated my question – DanG Sep 19 '21 at 14:48
1

I tried to put each step in a new object to better understand what is going on. In case any regex is unclear to you, do not hesitate to ask.

Ids are not limited to any number of digits, Individual Ids are just limited to start with "Ma(any character)_" and could easily be expanded, and hence one house ID could contain any number of individuals.

library(tidyverse)
df <- tibble::tribble(
  ~House_id,  ~id, ~new_weight,
  18105265, "Mab",        4567,
  18117631, "Maa",        3367,
  18121405, "Mab",        4500,
  71811763, "Maa",        2455,
  71811763, "Mab",        2872
)

# read the data
dat <- readLines("test.txt")

# convert to tibble
dat2 <- tibble::tibble(X = dat)

# keep relevant info, i.e. house IDs and individual IDs
dat3 <- dat2 %>% 
  rowid_to_column() %>% 
  filter(grepl(pattern = "H[0-9]+_0", X) | 
           grepl(pattern = "^Ma._[0-9]+", X))
dat3
#> # A tibble: 9 × 2
#>   rowid X                                 
#>   <int> <chr>                             
#> 1     1 H18105265_0                       
#> 2     3 Mab_3416311514210525745_W923650.80
#> 3     8 H18117631_0                       
#> 4    10 Maa_1240111711220682016_W123650.80
#> 5    13 H18121405_0                       
#> 6    15 Mab_2467211713110643835_W923650.80
#> 7    19 H71811763_0                       
#> 8    21 Maa_5325411210120486554_W923650.80
#> 9    22 Mab_5325411210110485554_W723650.80


# determine which individuals belong to which house
dat4 <- dat3 %>% 
  mutate(house1 = grepl(pattern = "H[0-9]+_0", X)) %>% 
  mutate(house2 = cumsum(house1))
dat4
#> # A tibble: 9 × 4
#>   rowid X                                  house1 house2
#>   <int> <chr>                              <lgl>   <int>
#> 1     1 H18105265_0                        TRUE        1
#> 2     3 Mab_3416311514210525745_W923650.80 FALSE       1
#> 3     8 H18117631_0                        TRUE        2
#> 4    10 Maa_1240111711220682016_W123650.80 FALSE       2
#> 5    13 H18121405_0                        TRUE        3
#> 6    15 Mab_2467211713110643835_W923650.80 FALSE       3
#> 7    19 H71811763_0                        TRUE        4
#> 8    21 Maa_5325411210120486554_W923650.80 FALSE       4
#> 9    22 Mab_5325411210110485554_W723650.80 FALSE       4


dat4b <- dat4 %>% 
  filter(grepl(pattern = "H[0-9]+_0", X)) %>% 
  select(house_id = X, house2)
dat4b
#> # A tibble: 4 × 2
#>   house_id    house2
#>   <chr>        <int>
#> 1 H18105265_0      1
#> 2 H18117631_0      2
#> 3 H18121405_0      3
#> 4 H71811763_0      4


# combine house and individual ids next to each other
dat5 <- dat4 %>% 
  left_join(dat4b,
            by = "house2") %>% 
  mutate(prefix = gsub(pattern = "_.+", replacement = "", x = X),
         house_id = as.numeric(gsub("^H|_0", "", house_id))) %>% 
  select(rowid, house_id, prefix, X) %>% 
  filter(grepl(pattern = "^Ma._[0-9]+", X)) 
dat5
#> # A tibble: 5 × 4
#>   rowid house_id prefix X                                 
#>   <int>    <dbl> <chr>  <chr>                             
#> 1     3 18105265 Mab    Mab_3416311514210525745_W923650.80
#> 2    10 18117631 Maa    Maa_1240111711220682016_W123650.80
#> 3    15 18121405 Mab    Mab_2467211713110643835_W923650.80
#> 4    21 71811763 Maa    Maa_5325411210120486554_W923650.80
#> 5    22 71811763 Mab    Mab_5325411210110485554_W723650.80


# add he new information about individual ids
dat6 <- left_join(dat5, df,
                  by = c("house_id" = "House_id",
                         "prefix" = "id"))
dat6
#> # A tibble: 5 × 5
#>   rowid house_id prefix X                                  new_weight
#>   <int>    <dbl> <chr>  <chr>                                   <dbl>
#> 1     3 18105265 Mab    Mab_3416311514210525745_W923650.80       4567
#> 2    10 18117631 Maa    Maa_1240111711220682016_W123650.80       3367
#> 3    15 18121405 Mab    Mab_2467211713110643835_W923650.80       4500
#> 4    21 71811763 Maa    Maa_5325411210120486554_W923650.80       2455
#> 5    22 71811763 Mab    Mab_5325411210110485554_W723650.80       2872


# generate the new ids
dat7 <- dat6 %>% 
  mutate(Y = gsub(pattern = "(?=W).+", replacement = "", x = X, perl = T),
         X_new = paste0(Y, "W", new_weight)) %>% 
  select(rowid, X_new)
dat7
#> # A tibble: 5 × 2
#>   rowid X_new                        
#>   <int> <chr>                        
#> 1     3 Mab_3416311514210525745_W4567
#> 2    10 Maa_1240111711220682016_W3367
#> 3    15 Mab_2467211713110643835_W4500
#> 4    21 Maa_5325411210120486554_W2455
#> 5    22 Mab_5325411210110485554_W2872


# replace the old ids by the new ones
dat[dat7$rowid] <- dat7$X_new
dat
#>  [1] "H18105265_0"                                                                           
#>  [2] "R1_0"                                                                                  
#>  [3] "Mab_3416311514210525745_W4567"                                                         
#>  [4] "T1_0"                                                                                  
#>  [5] "T2_0"                                                                                  
#>  [6] "T3_0"                                                                                  
#>  [7] "V64_0_2_010_ab171900171959"                                                            
#>  [8] "H18117631_0"                                                                           
#>  [9] "R1_0"                                                                                  
#> [10] "Maa_1240111711220682016_W3367"                                                         
#> [11] "T1_0"                                                                                  
#> [12] "V74_0_1_010_aa081200081259_aa081600081859_aa082100095659_aa095700101159_aa101300105059"
#> [13] "H18121405_0"                                                                           
#> [14] "R1_0"                                                                                  
#> [15] "Mab_2467211713110643835_W4500"                                                         
#> [16] "T1_0"                                                                                  
#> [17] "T2_0"                                                                                  
#> [18] "V62_0_1_010_090500092459_100500101059_101100101659_140700140859_141100141359"          
#> [19] "H71811763_0"                                                                           
#> [20] "R1_0"                                                                                  
#> [21] "Maa_5325411210120486554_W2455"                                                         
#> [22] "Mab_5325411210110485554_W2872"                                                         
#> [23] "T1_0"                                                                                  
#> [24] "T2_0"                                                                                  
#> [25] "T3_0"                                                                                  
#> [26] "T4_0"


# write back the updated data
# writeLines(...)
mnist
  • 6,571
  • 1
  • 18
  • 41
1

Here's a dplyr solution that uses a left_join()...but otherwise relies exclusively on vectorized operations, which are significantly more efficient than looping over large datasets.

While the code might appear lengthy, this is simply a formatting choice: the sake of clarity, I use

foo(
  arg_1 = bar,
  arg_2 = baz,
  # ...
  arg_n = qux
) 

rather than the one-liner foo(bar, baz, qux). Also for the sake of clarity, I will elaborate on the line

    # Map each row to its house ID.
    House_id = data[row_number()[target][cumsum(target)]],

in the Details section.

Solution

Given a file like subset.txt reproduced here

H18105265_0
R1_0
Mab_3416311514210525745_W923650.80
T1_0
T2_0
T3_0
V64_0_2_010_ab171900171959
H18117631_0
R1_0
Maa_1240111711220682016_W123650.80
T1_0
V74_0_1_010_aa081200081259_aa081600081859_aa082100095659_aa095700101159_aa101300105059
H18121405_0
R1_0
Mab_2467211713110643835_W923650.80
T1_0
T2_0
V62_0_1_010_090500092459_100500101059_101100101659_140700140859_141100141359
H71811763_0
R1_0
Maa_5325411210120486554_W923650.80
Mab_5325411210110485554_W723650.80
T1_0
T2_0
T3_0
T4_0

and a reference dataset like df reproduced here

df <- tibble::tribble(
  ~House_id,   ~id, ~new_weight,
   18105265, "Mab",        4567,
   18117631, "Maa",        3367,
   18121405, "Mab",        4500,
   71811763, "Maa",        2455,
   71811763, "Mab",        2872
)

the following solution

# For manipulating data.
library(dplyr)


# ...
# Code to generate your reference 'df'.
# ...



# Specify the filepath.
text_filepath <- "subset.txt"

# Define the textual pattern for each data item we want, where the relevant
# values are divided into their own capture groups.
regex_house_id <- "(H)(\\d+)(_)(\\d)"
regex_weighted_label <- "(M[a-z]{2,})(_)(\\d+)(_W)(\\d+(\\.\\d+)?)"



# Read the textual data (into a dataframe).
data.frame(data = readLines(text_filepath)) %>%

  # Transform the textual data.
  mutate(
    # Target (TRUE) the identifying row (house ID) for each (contiguous) group.
    target = grepl(
      # Use the textual pattern for house IDs.
      pattern = regex_house_id,
      x = data
    ),

    # Map each row to its house ID.
    House_id = data[row_number()[target][cumsum(target)]],

    # Extract the underlying numeric ID from the house ID.
    House_id = gsub(
      pattern = regex_house_id,
      # The numeric ID is in the 2nd capture group.
      replacement = "\\2",
      x = House_id
    ),

    # Treat the numeric ID as a number.
    House_id = as.numeric(House_id),



    # Target (TRUE) the weighted labels.
    target = grepl(
      # Use the textual pattern for weighted labels.
      pattern = regex_weighted_label,
      x = data
    ),

    # Extract the ID from (only) the weighted labels.
    id = if_else(
      target,
      gsub(
        pattern = regex_weighted_label,
        # The ID is in the 1st capture group.
        replacement = "\\1",
        x = data
      ),
      # For any data that is NOT a weighted label, give it a blank (NA) ID.
      as.character(NA)
    ),

    # Extract from (only) the weighted labels everything else but the weight.
    rest = if_else(
      target,
      gsub(
        pattern = regex_weighted_label,
        # Everything is in the 2nd, 3rd, and 4th capture groups; ignoring the ID
        # (1st) and the weight (5th).
        replacement = "\\2\\3\\4",
        x = data
      ),
      # For any data that is NOT a weighted label, make it blank (NA) for
      # everything else.
      as.character(NA)
    )
  ) %>%

  # Link (JOIN) each weighted label to its new weight; with blanks (NAs) for
  # nonmatches.
  left_join(df, by = c("House_id", "id")) %>%

  # Replace (only) the weighted labels, with their updated values.
  mutate(
    data = if_else(
      target,
      # Generate the updated value by splicing together the original components
      # with the new weight.
      paste0(id, rest, new_weight),
      # For data that is NOT a weighted label, leave it unchanged.
      data
    )
  ) %>%

  # Extract the column of updated values.
  .$data %>%

  # Overwrite the original text with the updated values.
  writeLines(con = text_filepath)

will transform your textual data and update the original file.

Result

The original file (here subset.txt) will now contain the updated information:

H18105265_0
R1_0
Mab_3416311514210525745_W4567
T1_0
T2_0
T3_0
V64_0_2_010_ab171900171959
H18117631_0
R1_0
Maa_1240111711220682016_W3367
T1_0
V74_0_1_010_aa081200081259_aa081600081859_aa082100095659_aa095700101159_aa101300105059
H18121405_0
R1_0
Mab_2467211713110643835_W4500
T1_0
T2_0
V62_0_1_010_090500092459_100500101059_101100101659_140700140859_141100141359
H71811763_0
R1_0
Maa_5325411210120486554_W2455
Mab_5325411210110485554_W2872
T1_0
T2_0
T3_0
T4_0

Details

The Regex

The text manipulation relies on simply the base functionality of grepl() (to identify matches) and gsub() (to extract components). We divide each of our textual patterns regex_house_id and regex_weighted_label into their components, as distinguished by capture groups in regular expressions:

#      The "H" prefix.      The "_" separator.
#                  | |      | |
regex_house_id <- "(H)(\\d+)(_)(\\d)"
#                     |    |   |   |
#  The digits following "H".   The "0" suffix (or any digit).
#                                The digits after the 'id'.
#   The 'id': "M" then 2 small letters.   |    |    The weight (possibly a decimal).
#                          |          |   |    |    |              |
regex_weighted_label <-   "(M[a-z]{2,})(_)(\\d+)(_W)(\\d+(\\.\\d+)?)"
#                                      | |      |  |
#                       The "_" separator.      The "_" separator and "W" prefix before weight.

We can use grepl(pattern = regex_weighted_label, x = my_strings) to check which strings in the vector my_strings match the format for a weighted label (like "Mab_3416311514210525745_W923650.80").

We can also use gsub(pattern = regex_weighted label, replacement = "\\5", my_labels) to extract the weights (the 5th capture group) from a vector my_labels of labels in that format.

The Mapping

Found within the first mutate() statement, the line

    # Map each row to its house ID.
    House_id = data[row_number()[target][cumsum(target)]],

might appear mystifying. However, it is simply a classic arithmetic trick (also employed by @mnist in their solution) to index contiguous values as groups.

The code cumsum(target) scans over the target column, which (at this point in the workflow) has logical values (TRUE FALSE FALSE ...) indicating whether (TRUE) or not (FALSE) the text line is a house ID (like "H18105265_0"). When it hits a TRUE (numerically a 1), it increments its running total, while FALSE (numerically a 0) leaves the total unchanged.

Since the textual data column

# |-------------- Group 1 ---------------| |----------- Group 2 ------------| |------------ ...
  "H18105265_0" "R1_0" ...                 "H18117631_0" "R1_0" ...           "H18121405_0" ...

has given us the logical target column

# |-------------- Group 1 ---------------| |----------- Group 2 ------------| |--------- ...
  TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE ...

these values (TRUE and FALSE) are coerced to numeric (1 and 0)

# |-------------- Group 1 ---------------| |----------- Group 2 ------------| |--------- ...
  1    0     0     0     0     0     0     1    0     0     0     0     0     1    0     ...

to yield the cumsum() here:

# |-------------- Group 1 ---------------| |----------- Group 2 ------------| |--------- ...
  1    1     1     1     1     1     1     2    2     2     2     2     2     3    3     ...  

Notice now we've mapped each row to its "group number". So much for cumsum(target).

Now for row_number()[target]! In effect, row_number() simply "indexes" each position (row)

# |-------------- Group 1 ---------------| |----------- Group 2 ------------| |--------- ...
  1             2      ...                 8             9      ...           13         ...

in the data column (or any other column):

# |-------------- Group 1 ---------------| |----------- Group 2 ------------| |------------ ...
  "H18105265_0" "R1_0" ...                 "H18117631_0" "R1_0" ...           "H18121405_0" ...

So subscripting those indices by target

# |-------------- Group 1 ---------------| |----------- Group 2 ------------| |--------- ...
  TRUE           FALSE ...                  TRUE          FALSE ...           TRUE       ...

selects only those positions with house IDs:

# |-------------- Group 1 ---------------| |----------- Group 2 ------------| |--------- ...
  1                                         8                                 13         ...

So if we take that result for row_number()[target]

# House ID: 1st 2nd 3rd ...
# Position:
            1   8   13  ... 

and subscript it by cumsum(target)

# |-------------- Group 1 ---------------| |----------- Group 2 ------------| |--------- ...
  1    1     1     1     1     1     1     2    2     2     2     2     2     3    3     ...

we map each row to the position (in data) of its house ID:

# |-------------- Group 1 ---------------| |----------- Group 2 ------------| |--------- ...
  1    1     1     1     1     1     1     8    8     8     8     8     8     13   13    ...

This is the outcome of row_number()[target][cumsum(target)].

Finally, when we subscript data by these (repeated) positions of its house IDs, we get the House_id column

# |----------------- Group 1 -----------------| |----------------- Group 2 -----------------| |-------------------------- ...
  "H18105265_0" "H18105265_0" ... "H18105265_0" "H18117631_0" "H18117631_0" ... "H18117631_0" "H18121405_0" "H18121405_0" ...

where each value in data is mapped to the house ID of its group.

Thanks to this House_id column

House_id = data[row_number()[target][cumsum(target)]]

next to our data column, we can map (left_join()) the ids in df to their corresponding textual data.

Greg
  • 3,054
  • 6
  • 27