2

I'm struggling with slicing. I thought that generally it's easy and I understand it but when it comes to the below situation my ideas don't work.

Situation: In one of my columns in DF I want to remove in all rows some string that sometimes occurs and sometimes doesn't.

The problem looks like this:

1.I don't know the exact position when this string starts (in each row it could be a different

2.This string various, depending on each row, however, it always starts from the same structure - let's say: "¯main_"

3.After "¯main_" usually, there're some numbers (it various) however the length always is the same (9 numbers)

4.I'm already after splitting and I have around ~40 columns (each with a similar problem). That's why I'm looking for some more efficient way to solve it then splitting, generating ~40 more columns and then dropping them.

5.Sometimes after this string with "¯main_" there's some additional string I'd like to leave in the same column.

Example:

Column1
A1-19
B2-52
C3-1245¯main_123456789
D4
Z89028
F7¯main_123456789,Z241

Looking for a result like this:

Column1
A1-19
B2-52
C3-1245
D4
Z89028
F7,Z241

The best solution that I prepared up till now:

a = test.find("¯")
b = a+14
df[0].str.slice(start = a, stop = b)

But:

1.It doesn't work properly

2.And I'm aware that test.find() returns -1 when it won't find a character. I don't know how to escape from it - writing a loop? I believe that some better (more efficient) solution exists. However, after a few hours of looking for it, I decided to find help.

QbS
  • 425
  • 1
  • 4
  • 17

1 Answers1

1

Loop by all column, split by position and append extracted strings by positions to helper list, last assign back to column:

print (df)
                   Column1
0                      NaN
1                    B2-52
2  C3-1245¯main_123456789
3                       D4
4                   Z89028
5  F7¯main_123456789,Z241

for c in df.columns:
    out = []
    for x in df[c]:
        if x == x:
            p = x.find('¯')
            if p != -1:
                out.append(x[:p] + x[p+14:])
            else:
                out.append(x)
        else:
            out.append(x)
    df[c] = out

print (df)
     Column1
0        NaN
1      B2-52
2  C3-1245Â9
3         D4
4     Z89028
5  F7Â9,Z241
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Thanks jezrael, you're right it works. However, I should have mentioned that I'm already after splitting and I have around ~40 columns with the described situation. That's why I'm looking for some more efficient way to implement may be something that will spare me generating +40 more columns and then dropping them. I'll add it to my question. Sorry for unclarity. – QbS Jul 20 '19 at 11:50
  • @KubaS - No problem. Do you need apply solution to all columns? – jezrael Jul 20 '19 at 11:52
  • Yes, and sometimes after this string with "¯main_" there's some additional string I'd like to leave in the same column. – QbS Jul 20 '19 at 11:55
  • @KubaS - Can you check my solution? It loop by all columns and apply solution. – jezrael Jul 20 '19 at 12:07
  • Looks like it should - I understand the logic behind it. Thanks! However, I don't know why I get AttributeError: 'float' object has no attribute 'find' referring to p = x.find('¯'). Do you have any idea what I should check? I checked the type of DF (pandas.core.frame.DataFrame) and columns (objects) and don't get why it's like that. – QbS Jul 20 '19 at 12:34
  • @KubaS - I think there are misisng values. – jezrael Jul 20 '19 at 12:36
  • 1
    It works. I just needed to change all None values to nan. For someone else who will face the same difficulty: df.fillna(value=pd.np.nan, inplace=True). To sum up, a thousand thanks! You saved me a lot of time, and I've learned today something new! – QbS Jul 20 '19 at 20:19