-2

I have a particular problem using R in RStudio, but i guess R in general. I have 2 columns where I need part of the data to be extracted and fill a new column based on the original data in the columns, for each . I have been trying to figure it out by myself for the past 8 hours and stuck.

1 column has "Record" as title with A12DE48,W8DE769,B97AB99,S29VV02Y,and D684SV2229 etc as example data. In this data the middle Alpha units are the important ones I do have a list for all of them, AB,AN,BU,DE,IK,LS,SV,EEQ,JFS,and PHT. As you can see they are in the middle and I wish to extract those alphabetical units into a new column "Item Type" for my dataset to run the model, as those are possibly good indicators. Is There a method that brings out only the extracts and puts only of those corresponding to the list I define it? Since I would have it only the ones found in the list rather than just focus on the alphabet, like I want to set the rule as extract from these options AB,AN,BU,DE,IK,LS,SV,EEQ,JFS,and PHT if it has atleast 1 value before and 1 value after regardless of where its number,alphabet, or special character

The OTHER column has a similar situation. This column "Item Source" has datapoints that go like A134, B223, C111, C2134, D2, E58, T(yes that one is just T) The main point is that the initial Alphabet relate to the set warehouse locations and I need those, but the twist is that for a huge number of them, multiple Sources exist as in a single entry, it will include "C111 D207 A965", while there are also many that are empty. How can I do the column thing here while replacing those with multiple sources with the Text "multiple source" as an entry and have unknown for the missing ones

Any help would be Appreciated, since this time I am only allowed to use R which im not too familiar with yet, especially coming from Java

Ziabytes
  • 13
  • 1
  • 4
  • 2
    Please show a small [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) and expected output based on that input data – akrun Dec 11 '15 at 04:25

1 Answers1

1

For the sake of argument let's say you have this dataset:

d = data.frame(record=c("A12DE48","W8DE769","B97AB99","D684SV2229"),
               source=c("A134", "", "T", "C111 D207 A965"),
               stringsAsFactors=FALSE)

For the first column, you can simply use regexps to pick the last group of letters in the string, e.g.:

d$short = gsub(".*\\d+([A-Z]+)\\d+$", "\\1", d$record)

For processing the source column you can simply remove everything but the letters:

d$source2 = gsub("[^A-Z]+","",d$source)

leading to

      record         source short source2
1    A12DE48           A134    DE       A
2    W8DE769                   DE        
3    B97AB99              T    AB       T
4 D684SV2229 C111 D207 A965    SV     CDA

Now you can decide what to do with the multiple sources - either keep them all or replace them such as:

d$source2[nchar(d$source2) > 1] = "multiple sources"
d$source2[nchar(d$source2) == 0] = "empty"

The end result:

      record         source short          source2
1    A12DE48           A134    DE                A
2    W8DE769                   DE            empty
3    B97AB99              T    AB                T
4 D684SV2229 C111 D207 A965    SV multiple sources
Simon Urbanek
  • 13,842
  • 45
  • 45
  • Thank you very much While I did say they were in the MIDDLE, it was really a crucial mistake from me that I did not include an example where an entry on "record" ENDS with an ALPHABET, here is such example "S29VV02Y" I am very very sorry for bothering you with this However, Thank you very very much for the second half – Ziabytes Dec 11 '15 at 05:33