1

I would like to remove unnecessary information from a column, with the aim of totaling the amount.

Col_X currently has multiple sections within a whole, split by their 5 digit segment "19651". The segment numbers need to be removed, the Currency needs to be moved to a new column, and the numbers totaled.

So for the last section, it should have two columns reading

Col_X  | CCY  
429.34 | EUR

Col_X  
19651: 10 GBP  
19662: 1.2 GBP  
19663: 6 GBP  
19852: 108.88 EUR  
19926: 147 EUR | 39927: 86.9 EUR | 39928: 49.35 EUR  
19994: 30.95 EUR | 29990: 298.4 EUR | 19996: 99.99 EUR  

I have tried to remove the first section of numbers, but this fails - so i am struggling to work out how to remove them in between the key data i need.

Col_X <- gsub("^.?:","",bill$Col_X)

the first part of the code is failing to remove any numbers from the start.

M--
  • 25,431
  • 8
  • 61
  • 93
HP.
  • 37
  • 5
  • Hey there. I see that you have on the same row more than one segment divided by '|'. Do you need the other segments removed as well?. Removing only the first part this will do the trick `gsub('^\\d*:\\s', '', bill$Col_X)` – Steve Aug 21 '19 at 10:48
  • Hi Steve. For some reason, that code isn't taking anything off the column. Yes ideally i need each section off, is it better doing it one at a time? – HP. Aug 21 '19 at 10:53
  • Could you paste the `dput()` output of your current dataframe object? – Steve Aug 21 '19 at 10:54
  • ``` c("10 GBP", "1.2 GBP", "6 GBP", "108.88 EUR", "147 EUR | 39927: 86.9 EUR | 39928: 49.35 EUR", ``` seems like it is off on dput but not when i head(df) – HP. Aug 21 '19 at 10:59
  • Hey. This is probably not all the `dput()` output. Check [here](https://stackoverflow.com/a/5963610/7856717) for a good explanation/way for how to generate your `dput()` output. – Steve Aug 21 '19 at 11:03
  • `c("10 GBP", "1.2 GBP", "6 GBP", "108.88 EUR", "147 EUR | 39927: 96.4 EUR | 39928: 49.35 EUR", "30.95 EUR | 29990: 298.4 EUR | 19996: 99.99 EUR" )` `dput(head(Col_X,6))` – HP. Aug 21 '19 at 11:18
  • ignore me - it works, i didn't direct it back to the df. How would I then take from the other sections? – HP. Aug 21 '19 at 11:25
  • Glad to hear it:) If you need to split the rest of the segments, try `str_split_fixed(bill$Col_X, ' \\| ', 3)`. Good luck! – Steve Aug 21 '19 at 11:27

1 Answers1

2

You may use something like that: remove all but currency values and sum them up per entry. Then, grab the 3 letter whole word from each. Then, paste them together:

x <-  c("10 GBP", "1.2 GBP", "6 GBP", "108.88 EUR", "147 EUR | 39927: 86.9 EUR | 39928: 49.35 EUR")
amounts <- lapply(x, function(m) sum(scan(text=gsub("\\s*(?:\\d+:|[A-Z]+)\\s*", "", m), sep="|", what = numeric(), quiet=TRUE)))
currency <- sub(".*?\\b([A-Z]{3})\\b.*", "\\1", x)
paste(amounts, currency, sep=" ") 
## => [1] "10 GBP"     "1.2 GBP"    "6 GBP"      "108.88 EUR" "283.25 EUR"

See the R demo online.

The gsub("\\s*(?:\\d+:|[A-Z]+)\\s*", "", m) removes

  • \s* - 0+ whitespaces
  • (?:\d+:|[A-Z]+) - 1+ digits and : after them or 1+ uppercase ASCII letters
  • \s* - 0+ whitespaces

The sub(".*?\\b([A-Z]{3})\\b.*", "\\1", x) "extracts" the first three-letter currency code:

  • .*? - any 0+ chars as few as possible
  • \b([A-Z]{3})\b - Group 1 (\1): a word boundary, 3 ASCII letters, a word boundary
  • .* - any 0+ chars as many as possible
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • 1
    This is amazing. Solved the entire problem in one go. greatly appreciate the breakdown of your workings. I am going to analyse this offline! – HP. Aug 21 '19 at 11:49