3

I have a field which contains two charecters, some digits and potentially a single letter. For example

QU1Y
ZL002
FX16
TD8
BF007P
VV1395
HM18743
JK0001

I would like to consistently return all letters in their original position, but digits as follows.

for 1 to 3 digits : return all digits OR the digits left padded with zeros

For 4 or more digits : it must not begin with a zero and return the 4 first digits OR if the first is a zero then truncate to three digits

example from the data above

QU001Y
ZL002
FX016
TD008
BF007P
VV1395
HM1874
JK001

The implementation will be in R but I'm interested in a straight regex solution, I'll work out the R side of things. It may not be possible in straight regex which is why I can't get my head round it.

This identifies the correct ones, but I'm hoping to correct those which are not right.

"[A-Z]{2}[1-9]{0,1}[0-9]{1,3}[F,Y,P]{0,1}"

For the curious, they are flight numbers but entered by a human. Hence the variety...

rj3838
  • 83
  • 5
  • You won't be able to solve it without a bit of code. Use `gsubfn` once you are sure you know the right pattern to match the strings where modification is required. – Wiktor Stribiżew Oct 08 '18 at 12:47
  • If the first two letters must exist, use `gsubfn('^[A-Z]{2}\\K0*(\\d{1,4})\\d*', ~ sprintf("%03d",as.numeric(x)), l, perl=TRUE)` – Wiktor Stribiżew Oct 08 '18 at 13:30
  • Please see [How to make a great R reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). – Wiktor Stribiżew Oct 09 '18 at 08:41
  • fred <- gsubfn("^[A-Z]{2}\\K0*(\\d{1,4})\\d*", + ~ sprintf("%03d", as.numeric(x)), + preactorDF[["Flight No"]], + perl = TRUE) Error: is.character(x) is not TRUE – rj3838 Oct 09 '18 at 08:46
  • Please update the post with what your `preactorDF[["Flight No"]]` is, use `dput`. – Wiktor Stribiżew Oct 09 '18 at 08:49
  • The preactorDF has 1.3 million records. dput would be massive. – rj3838 Oct 09 '18 at 09:02
  • Try `dput(head(preactorDF,10))` – Wiktor Stribiżew Oct 09 '18 at 09:06
  • `dput(head(preactorDF[["Flight No"]],10)) c("BA038", "BA038", "BA247", "BA247", "BA198", "BA238", "BA238", "BA057", "BA199", "BA199")'` but this is a small selection and very tidy. other rows have NA in which may be the problem. – rj3838 Oct 09 '18 at 09:18
  • I cannot help more if you keep the data to yourself. The comment above does not help. Please provide an [MCVE](https://stackoverflow.com/help/mcve). – Wiktor Stribiżew Oct 09 '18 at 09:21
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/181538/discussion-between-rj3838-and-wiktor-stribizew). – rj3838 Oct 09 '18 at 09:28

1 Answers1

0

You may use

> library(gsubfn)
> l <- c("QU1Y", "ZL002", "FX16", "TD8", "BF007P", "VV1395", "HM18743", "JK0001")
> gsubfn('^[A-Z]{2}\\K0*(\\d{1,4})\\d*', ~ sprintf("%03d",as.numeric(x)), l, perl=TRUE)
[1] "QU001Y" "ZL002"  "FX016"  "TD008"  "BF007P" "VV1395" "HM1874" "JK001" 

The pattern matches

  • ^ - start of string
  • [A-Z]{2} - two uppercase letters
  • \\K - the text matched so far is removed from the match
  • 0* - 0 or more zeros
  • (\\d{1,4}) - Capturing group 1: one to four digits
  • \\d* - 0+ digits.

Group 1 is passed to the callback function where sprintf("%03d",as.numeric(x)) pads the value with the necessary amount of digits.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • @rj3838 Provide your input data in the question body. – Wiktor Stribiżew Oct 09 '18 at 08:41
  • I have errors `fred <- gsubfn("^[A-Z]{2}\\K0*(\\d{1,4})\\d*", + ~ sprintf("%03d", as.numeric(x)), + preactorDF[["Flight No"]], + perl = TRUE) Error: is.character(x) is not TRUE' My source is 1324156 rows so a clue as to why may help. – rj3838 Oct 09 '18 at 08:50
  • @rj3838 I cannot help you with that since you have not provided reproducible code, the solution above works well with the strings you posted. – Wiktor Stribiżew Oct 09 '18 at 08:51