1

I have the following strings in a pandas DataFrame in Python3, column string1 and string2:

import pandas as pd

datainput = [
    { 'string1': 'TTTABCDABCDTTTTT', 'string2': 'ABABABABABABABAA' },
    { 'string1': 'AAAAAAAA', 'string2': 'TTAAAATT' },
    { 'string1': 'TTABCDTTTTT', 'string2': 'ABABABABABA' }
]

df = pd.DataFrame(datainput)

df
            string1           string2
0  TTTABCDABCDTTTTT  ABABABABABABABAA
1          AAAAAAAA          TTAAAATT
2       TTABCDTTTTT       ABABABABABA

For each row, strings in columns string1 and string2 are defined to be the same length.

For each row of the DataFrame, the strings may need to be "cleaned" of beginning/trailing letters 'T'. However, for each row, the strings need to both be stripped of the same number of characters, so as the strings remain the same length.

The correct output is as follows:

df
            string1           string2
0          ABCDABCD      BABABABA
1          AAAA          AAAA
2          ABCD          ABAB

If these were two variables, it would be straightforward to calculate this with strip(), e.g.

string1 = "TTTABCDABCDTTTTT"
string2 = "ABABABABABABABAA"

length_original = len(string1)
num_left_chars = len(string1) - len(string1.lstrip('T'))
num_right_chars = len(string1.rstrip('T'))
edited = string1[num_left_chars:num_right_chars]
## print(edited)
## 'ABCDABCD'

However, in this case, one needs to iterate through all rows and redefine two rows at once. How could one modify each these strings row by row?

EDIT: My main confusion is, given both columns could T, how do I re-define them both?

ShanZhengYang
  • 16,511
  • 49
  • 132
  • 234

2 Answers2

1
raw_data = {'name': ['Will Morris', 'Alferd Hitcock', 'Sir William', 'Daniel Thomas'],
                'age': [11, 49, 66, 77],
                'color': ['TblueT', 'redT', 'white', "cyan"],
                'marks': [74, 90, 44, 17]}
df = pd.DataFrame(raw_data, columns = ['name', 'age', 'color', 'grade'])
print(df)
cols =  ['name','color']
print("new df")
#following line does the magic 

df[cols] = df[cols].apply(lambda row: row.str.lstrip('T').str.rstrip('T'), axis=1)
print(df)

Will print

               name  age   color  grade
0  TWillard MorrisT   20  TblueT     88
1       Al Jennings   19    redT     92
2      Omar Mullins   22  yellow     95
3  Spencer McDaniel   21   green     70

new df

               name  age   color  grade
0    Willard Morris   20    blue     88
1       Al Jennings   19     red     92
2      Omar Mullins   22  yellow     95
3  Spencer McDaniel   21   green     70
stormfield
  • 1,696
  • 1
  • 14
  • 26
  • Thanks for the help. But I need to modify both columns, as both columns could have `T`. – ShanZhengYang Oct 01 '18 at 02:39
  • @ShanZhengYang I have changed the code now and it works as intended. Please mark the answer as correct . – stormfield Oct 01 '18 at 03:07
  • I think there's still some confusion---I'm happy to edit my question. The strings in both columns are equal length strings. If you remove N number of characters in one string, you must remove N number of character in the other string. – ShanZhengYang Oct 01 '18 at 16:03
  • I was very sleepy when i wrote the original answer. I am not quite sure if there is a easy way to do this. The answer written by Yoshi Hammer should work if you are looking for a hack. I will try to give you a better version soon. – stormfield Oct 01 '18 at 16:09
  • Thanks! I appreciate the help if you're able to get something more efficient/quicker. – ShanZhengYang Oct 01 '18 at 17:35
1

A bit lengthy but gets the job done..

import re
def count_head(s):
    head = re.findall('^T+', s)
    if head:
        return len(head[0])
    return 0
def count_tail(s):
    tail = re.findall('T+$', s)
    if tail:
        return len(tail[0])
    return 0
df1 = df.copy()
df1['st1_head'] = df1['string1'].apply(count_head)
df1['st2_head'] = df1['string2'].apply(count_head)
df1['st1_tail'] = df1['string1'].apply(count_tail)
df1['st2_tail'] = df1['string2'].apply(count_tail)
df1['length'] = df1['string1'].str.len()

def trim_strings(row):
    head = max(row['st1_head'], row['st2_head'])
    tail = max(row['st1_tail'], row['st2_tail'])
    l = row['length']
    return {'string1': row['string1'][head:(l-tail)],
           'string2': row['string2'][head:(l-tail)]}
new_df = pd.DataFrame(list(df1.apply(trim_strings, axis=1)))
print(new_df)

output:

    string1   string2
0  ABCDABCD  BABABABA
1      AAAA      AAAA
2      ABCD      ABAB

A more compact version:

def trim(st1, st2):
    l = len(st1)
    head = max(len(st1) - len(st1.lstrip('T')), 
              len(st2) - len(st2.lstrip('T')))
    tail = max(len(st1) - len(st1.rstrip('T')), 
              len(st2) - len(st2.rstrip('T')))
    return (st1[head:(l-tail)],
           st2[head:(l-tail)])

new_df = pd.DataFrame(list(
    df.apply(lambda r: trim(r['string1'], r['string2']), 
         axis=1)), columns=['string1', 'string2'])
print(new_df)

The main thing to notice is the df.apply(<your function>, axis=1), which lets you do any function (in this case acting on both columns at once), on each row.

Yosi Hammer
  • 588
  • 2
  • 8
  • This is very intelligent, and answers my question. We create a new column for the count of each string, and then we use this to manipulate the new columns. – ShanZhengYang Oct 01 '18 at 16:05