0

So I have a dataframe with 6 columns. "Tags" is supposed to only have 6 characters, but sometimes the API I am pulling from likes to mess things up. It looks like:

import pandas as pd
df = pd.DataFrame({'user': ['Ticket ID', 'Closed Time', 'Tags'], 
                   'income': [1, 2, 3, ],
                   'Closed Time': ['08/19/20', '08/18/20', '08/17/20'],
                   'Tags': [270201, 284912, 123456789101]})

Currently my code is:

`df['Tags'].replace(to_replace='[^0-9]+', value='', inplace = True, regex = True) 

df['Tags'] = df['Tags'].astype(str).str.zfill(6)`

That just filters out the garbage that sometimes comes into the column. I am not sure where to start, I need something that if something in 'Tags' is longer than 6 characters, it splits Tags and duplicates the rest of the row.

  • could you please provide sample data ? Thanks – AMH Aug 19 '20 at 12:51
  • hyperlinked above, or do you need it written out? Sorry new here. – Matt Topic Aug 19 '20 at 12:59
  • written out is easier for me to try on my machine, something I can copy/paste. By the way: https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples – AMH Aug 19 '20 at 13:09

2 Answers2

0
df = pd.DataFrame({'user': ['Ticket ID', 'Closed Time', 'Tags', 'ADDED'], 
                   'income': [1, 2, 3, 4],
                   'Closed Time': ["08/19/20", "08/18/20", "08/17/20", "01/01/01"],
                   'Tags': [270201, 284912, 123456789101, 987456321456]})

# seperate rows with long tags
df["Tags"] = df["Tags"].astype(int)
df_short_tags = df.loc[df.Tags<10**6].copy()
df_long_tags = df.loc[df.Tags>=10**6].copy()

# split tags column
df_long_tags["Tags_right"] = df_long_tags.Tags.astype(str).str[6:]
df_long_tags["Tags_left"] = df_long_tags.Tags.astype(str).str[:6]

# rearange dataframe
df_tags_split = df_long_tags.melt(id_vars=["user", "income", "Closed Time"], 
                                  value_vars=["Tags_right","Tags_left"], 
                                  value_name="Tags").drop(columns="variable")

# add to rows with valid tags
df_result = pd.concat([df_short_tags, df_tags_split]).reset_index(drop=True)
     user  income Closed Time    Tags
0    Ticket ID       1    08/19/20  270201
1  Closed Time       2    08/18/20  284912
2         Tags       3    08/17/20  789101
3        ADDED       4    01/01/01  321456
4         Tags       3    08/17/20  123456
5        ADDED       4    01/01/01  987456
AMH
  • 502
  • 2
  • 10
  • Thanks AMH, I get a type error, so I am guessing these is something wrong with my data:TypeError: '<' not supported between instances of 'str' and 'int' – Matt Topic Aug 19 '20 at 14:33
  • It probably means that your "Tags" column is of type `string`. I edited my answer to make sure it is `int` – AMH Aug 19 '20 at 14:35
  • Weird, now this: OverflowError: Python int too large to convert to C long – Matt Topic Aug 19 '20 at 14:40
  • Just after loading your data, enter `df.info()`. what Dtype do you have for "Tags" ? – AMH Aug 19 '20 at 14:42
  • I believe it is an object? – Matt Topic Aug 19 '20 at 17:02
  • Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Ticket Id 136 non-null int64 1 Source 136 non-null object 2 Type 136 non-null object 3 Agent 136 non-null object 4 Closed Time 136 non-null object 5 Tags 134 non-null object dtypes: int64(1), object(5) memory usage: 3.8+ KB – Matt Topic Aug 19 '20 at 17:03
  • maybe try `.astype(float)` instead – AMH Aug 20 '20 at 06:59
0

Here is the function I wrote

    import pandas as pd

    def separate_row(df, column_name):
        ls = []
        # convert the column to str 
        for row_dict in df.astype({column_name: str}).to_dict('records'):
            if len(row_dict[column_name]) == 6:
                ls.append(row_dict)
                continue
            for i in range(len(row_dict[column_name])//6):
                row = row_dict.copy()
                row[column_name] = row_dict[column_name][i*6:(i+1)*6]
                ls.append(row)
        return pd.DataFrame(ls).astype({column_name: df[column_name].dtype})

Here is the test result

    >>> df = pd.DataFrame({'user': ['Ticket ID', 'Closed Time', 'Tags'],
    ...                    'income': [1, 2, 3, ],
    ...                    'Closed Time': ['08/19/20', '08/18/20', '08/17/20'],
    ...                    'Tags': [270201, 284912, 123456789101]})
    >>> separate_row(df, "Tags")
              user  income Closed Time    Tags
    0    Ticket ID       1    08/19/20  270201
    1  Closed Time       2    08/18/20  284912
    2         Tags       3    08/17/20  123456
    3         Tags       3    08/17/20  789101
    >>> df = pd.DataFrame({'user': ['Ticket ID', 'Closed Time', 'Tags'],
    ...                    'income': [1, 2, 3, ],
    ...                    'Closed Time': ['08/19/20', '08/18/20', '08/17/20'],
    ...                    'Tags': [270201, 284912, 123456789101123456]})
    >>> separate_row(df, "Tags")
              user  income Closed Time    Tags
    0    Ticket ID       1    08/19/20  270201
    1  Closed Time       2    08/18/20  284912
    2         Tags       3    08/17/20  123456
    3         Tags       3    08/17/20  789101
    4         Tags       3    08/17/20  123456
Zhiwei
  • 179
  • 1
  • 7