Remove & sum within column

Question

I would like to remove unnecessary information from a column, with the aim of totaling the amount.

Col_X currently has multiple sections within a whole, split by their 5 digit segment "19651". The segment numbers need to be removed, the Currency needs to be moved to a new column, and the numbers totaled.

So for the last section, it should have two columns reading

Col_X  | CCY  
429.34 | EUR

Col_X  
19651: 10 GBP  
19662: 1.2 GBP  
19663: 6 GBP  
19852: 108.88 EUR  
19926: 147 EUR | 39927: 86.9 EUR | 39928: 49.35 EUR  
19994: 30.95 EUR | 29990: 298.4 EUR | 19996: 99.99 EUR

I have tried to remove the first section of numbers, but this fails - so i am struggling to work out how to remove them in between the key data i need.

Col_X <- gsub("^.?:","",bill$Col_X)

the first part of the code is failing to remove any numbers from the start.

Hey there. I see that you have on the same row more than one segment divided by '|'. Do you need the other segments removed as well?. Removing only the first part this will do the trick `gsub('^\\d*:\\s', '', bill$Col_X)` — Steve, Aug 21 '19 at 10:48
Hi Steve. For some reason, that code isn't taking anything off the column. Yes ideally i need each section off, is it better doing it one at a time? — HP., Aug 21 '19 at 10:53
Could you paste the `dput()` output of your current dataframe object? — Steve, Aug 21 '19 at 10:54
``` c("10 GBP", "1.2 GBP", "6 GBP", "108.88 EUR", "147 EUR | 39927: 86.9 EUR | 39928: 49.35 EUR", ``` seems like it is off on dput but not when i head(df) — HP., Aug 21 '19 at 10:59
Hey. This is probably not all the `dput()` output. Check [here](https://stackoverflow.com/a/5963610/7856717) for a good explanation/way for how to generate your `dput()` output. — Steve, Aug 21 '19 at 11:03
`c("10 GBP", "1.2 GBP", "6 GBP", "108.88 EUR", "147 EUR | 39927: 96.4 EUR | 39928: 49.35 EUR", "30.95 EUR | 29990: 298.4 EUR | 19996: 99.99 EUR" )` `dput(head(Col_X,6))` — HP., Aug 21 '19 at 11:18
ignore me - it works, i didn't direct it back to the df. How would I then take from the other sections? — HP., Aug 21 '19 at 11:25
Glad to hear it:) If you need to split the rest of the segments, try `str_split_fixed(bill$Col_X, ' \\| ', 3)`. Good luck! — Steve, Aug 21 '19 at 11:27

score 2 · Accepted Answer · answered Aug 21 '19 at 11:30

You may use something like that: remove all but currency values and sum them up per entry. Then, grab the 3 letter whole word from each. Then, paste them together:

x <-  c("10 GBP", "1.2 GBP", "6 GBP", "108.88 EUR", "147 EUR | 39927: 86.9 EUR | 39928: 49.35 EUR")
amounts <- lapply(x, function(m) sum(scan(text=gsub("\\s*(?:\\d+:|[A-Z]+)\\s*", "", m), sep="|", what = numeric(), quiet=TRUE)))
currency <- sub(".*?\\b([A-Z]{3})\\b.*", "\\1", x)
paste(amounts, currency, sep=" ") 
## => [1] "10 GBP"     "1.2 GBP"    "6 GBP"      "108.88 EUR" "283.25 EUR"

See the R demo online.

The gsub("\\s*(?:\\d+:|[A-Z]+)\\s*", "", m) removes

\s* - 0+ whitespaces
(?:\d+:|[A-Z]+) - 1+ digits and : after them or 1+ uppercase ASCII letters
\s* - 0+ whitespaces

The sub(".*?\\b([A-Z]{3})\\b.*", "\\1", x) "extracts" the first three-letter currency code:

.*? - any 0+ chars as few as possible
\b([A-Z]{3})\b - Group 1 (\1): a word boundary, 3 ASCII letters, a word boundary
.* - any 0+ chars as many as possible

This is amazing. Solved the entire problem in one go. greatly appreciate the breakdown of your workings. I am going to analyse this offline! — HP., Aug 21 '19 at 11:49

Remove & sum within column

1 Answers1