0

I'm working with translation data and am trying to distinguish between simple typos and actual text modifications; typos are identified as those alterations that do not occur within 7 keystrokes after or before another alteration or broken word (longer pause within a word; those I've managed to identify). Ideally, the code would also check if any other alterations or broken words occur in the same word, regardless of how many keystrokes that is removed from the present one. The variable Issue takes the values 'BW' for broken word, 'ALT' for alterations, and 'none' if new text is produced smoothly. Each 'Id' represents a keystroke. 'Count' keeps track of the word count, i.e. all keystrokes contributing to the first word are labeled 1, the second word 2, etc.

I'd like to divvy up the 'ALT' group into 'ALT' and 'Typo' by determining for each 'ALT' if another issue ('ALT' or 'BW') pops up within 7 keystrokes before or after it, or within the same word, regardless of how long that word is (i.e. for each 'Id' with the same 'Count'). If this is not the case, the 'ALT' should be considered a 'Typo'.

I've tried a bunch of nested for and if statements, but these tend to get problematic or claim that arguments 'have length 0' when this is not the case; I'm not that skilled a coder to actually get them to work. In my final attempt, below, I'd settled for any issues within 7 keystrokes before or after the 'ALT', and not taken into account the word itself, although that is not ideal.

Example dataset:

T01 <- structure(list(Id = 1:100, Count = c(1L, 1L, 1L, 1L, 1L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 
5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 
5L, 5L, 5L, 5L, 5L, 5L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 
6L, 6L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 
7L, 7L, 8L, 8L, 8L, 8L, 8L, 9L, 9L, 9L, 9L, 9L, 9L, 10L), Issue = c("none", 
"none", "none", "none", "none", "none", "BW", "none", "ALT", 
"ALT", "ALT", "BW", "none", "none", "none", "none", "none", "none", 
"none", "none", "none", "none", "none", "none", "none", "ALT", 
"none", "none", "none", "none", "none", "none", "none", "none", 
"none", "none", "none", "none", "none", "BW", "ALT", "ALT", "ALT", 
"ALT", "ALT", "ALT", "BW", "BW", "none", "BW", "ALT", "ALT", 
"BW", "none", "none", "none", "none", "none", "none", "none", 
"none", "none", "none", "none", "none", "none", "none", "none", 
"none", "ALT", "none", "none", "none", "ALT", "BW", "none", "BW", 
"ALT", "ALT", "ALT", "BW", "BW", "none", "none", "BW", "none", 
"none", "none", "none", "none", "none", "none", "none", "none", 
"none", "none", "none", "none", "none", "none")), row.names = c(NA, 
100L), class = "data.frame")

Final coding attempt:

library(dplyr)

Changes <- which(T01$Issue == 'Alt') # Identify which issues are alterations, and therefore possibly typos
T01$Typo <- F           #Create a column in the dataframe to store values
for (index in 1:length(T01$Id)) {     # i.e. for each row
for (x in 1:7) {                      # check up to 7 rows before or after this one
if (T01$Id[index] %in% Changes) {     # Only check indices where T01$Issue == 'ALT'
if ((T01$Issue[index-x] != 'ALT') & (T01$Issue[index-x] != 'BW') & 
# no alterations or broken words up to seven keystrokes before this one
(T01$Issue[index+x] != 'ALT') & (T01$Issue[index+x] != 'BW')) 
# no alterations or broken words up to seven keystrokes after this one
{T01$Typo[index] <- T}}}}

Hoping someone here can help out!

Michiel
  • 3
  • 3
  • Welcome to StackOverflow. Please read these guidelines and edit your question to improve it: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example I have read the description three times and still don't understand what you're trying to do, other than split ALT into ALT or Typo, but the conditions for doing so are still a mystery. Please take your code and translate it to English. Relate the ALT and Type and BW more clearly to the data in data frame. Per the guidelines I linked, make edits to your post, don't add information as comments. – John Polo Feb 24 '23 at 02:34
  • Made some edits, hope this helps clear things up! – Michiel Feb 24 '23 at 08:25

1 Answers1

1

We can do it with group_by(Count) and zoo::rollapply().

rollapply creates a window and apply a function, partial = TRUE is to consider incomplete windows, ie there isn't 7 observations before

library(dplyr)

T01 %>% 
  arrange(Id, Count) %>%
  group_by(Count) %>% 
  mutate(teste = ifelse(Issue == "ALT", 
                        ifelse(
                          zoo::rollapply(Issue,
                                         width = list(c(-7:-1,1:7)), 
                                         \(x) any(x == "ALT" | x == "BW"),
                                         partial = TRUE),
                          "Typo","ALT"),
                        Issue
                        )
         ) 
#> # A tibble: 100 × 4
#> # Groups:   Count [10]
#>       Id Count Issue teste
#>    <int> <int> <chr> <chr>
#>  1     1     1 none  none 
#>  2     2     1 none  none 
#>  3     3     1 none  none 
#>  4     4     1 none  none 
#>  5     5     1 none  none 
#>  6     6     2 none  none 
#>  7     7     2 BW    BW   
#>  8     8     2 none  none 
#>  9     9     2 ALT   Typo 
#> 10    10     2 ALT   Typo 
#> # … with 90 more rows

Created on 2023-02-24 with reprex v2.0.2

M Aurélio
  • 830
  • 5
  • 13