1

I have a DataFrame in which one column is rows of strings that look like:

Received value 126;AOC;H3498XX from 602
Received value 101;KYL;0IMMM0432 from 229

I want to drop (or replace with nothing) the part after the second semicolon so that it looks like

Received value 126;AOC; from 602

But this part I want to drop will have varying and unpredictable lengths (always combinations of A-Z and 0-9). The semicolons and froms will always be there for reference.

I'm trying to use regex by studying this link: https://docs.python.org/3/library/re.html

import re
for row in df[‘column’]:
    row = re.sub(‘;[A-Z0-9] from’ , ‘; from’, row)

I think the [A-Z0-9] fails to incorporate the varying length aspect I want.

Theo
  • 57,719
  • 8
  • 24
  • 41
Eric N.
  • 13
  • 2

2 Answers2

2

An example using str.replace() with str.split():

s = ['126;AOC;H3498XX from 602', '101;KYL;0IMMM0432 from 229']

for elem in s:
    print(elem.replace(elem.split(";",2)[-1].split()[0],''))

OUTPUT:

126;AOC; from 602
101;KYL; from 229

EDIT:

The same would work with the following example as well:

s = ['Received value 126;AOC;H3498XX from 602', 'Received value 101;KYL;0IMMM0432 from 229']

for elem in s:
    print(elem.replace(elem.split(";",2)[-1].split()[0],''))

OUTPUT:

Received value 126;AOC; from 602
Received value 101;KYL; from 229
DirtyBit
  • 16,613
  • 4
  • 34
  • 55
  • This works perfectly when I use print(), but I want the output to remain in the column of my data frame. When I try for elem in s: s['column'] = elem.replace.... it doesn't give me the expected output. Do you know how to keep the output within the column of that dataframe? – Eric N. Apr 11 '19 at 12:16
  • @EricN. you could iterate through the desired rows and replace the values: https://stackoverflow.com/questions/25478528/updating-value-in-iterrow-for-pandas – DirtyBit Apr 11 '19 at 12:17
1

Use pattern (Received value \d+;[A-Z]+;)\w+(\s.*?)

Ex:

import re

s = ["Received value 126;AOC;H3498XX from 602", "Received value 101;KYL;0IMMM0432 from 229"]

for i in s:
    print( re.sub(r"(Received value \d+;[A-Z]+;)\w+(\s.*?)", r"\1", i) )

Output:

Received value 126;AOC;from 602
Received value 101;KYL;from 229
Rakesh
  • 81,458
  • 17
  • 76
  • 113