Parsing text into it's own field, counting, and reshaping select fields to wide format

Question

I'm doing an analysis using Python to see how long we are keeping conversations in our social media channels by counting the number of interactions and reading the messages.

I'm thinking the approach would be to make the first table look like the second table. Steps to get there:

Parse our the @username into it's own field. 99% of the time, our responses start with @username
Count the number of times the @username shows up - this indicates the number of messages we send the user
Turn the inbound and outbound messages from long to wide format. Not certain how many fields I'd have to create but as long as I know the technique, I can worry about the # of fields later.

Right now timestamp isn't that important to me but it could be in the future. Regardless, I would use the same technique as the inbound and outbound messages

Before table

    Inbound Message      Outbound message     Inbound Time     Outbound Time     Account
    Hello, how are you   @userA I'm good!     mm/dd/yy hh:mm   mm/dd/yy hh:mm    FB
    Where is the cat?    @userB what cat?     mm/dd/yy hh:mm   mm/dd/yy hh:mm    Twitter
    What is a pie        @user3 it is a food  mm/dd/yy hh:mm   mm/dd/yy hh:mm    Twitter
    The black cat        @userB I don't know  mm/dd/yy hh:mm   mm/dd/yy hh:mm    Twitter

After table

User    Messages  Account   Inbound 1           Outbound 1           Inbound 2      Outbound 2           Inbound 3 Outbound 3 
userA   1         FB        Hello, how are you  @user1 I'm good!     null           null                  null     null 
userB   2         Twitter   Where is the cat?   @user2 what cat?     The black cat  @user2 I don't know   null     null
user3   1         Twitter   What is a pie       @user3 it is a food  null           null                  null     null

Just out of curiosity - is there any reason you want "wide" data - it's never normally very easy to manipulate even if you limit to a fixed number of columns... (and of course - adding/removing data will mean you have to potentially re-shape your table each time...) — Jon Clements, Sep 03 '16 at 19:16
Good question. It's mostly because I want it to be easy for me to read. If I see that the most number of messages will be for user2, then I can read the interaction step by step by going left to right with the data. You do make some good points though on how wide format limits how dynamic your tables are. I suppose I can just filter by user2 in long format, sort by timestamp, and read up/down. — Anjan G, Sep 03 '16 at 19:29

score 1 · Answer 1 · edited May 23 '17 at 12:16

Your transformation process should include several steps, I will give an idea only to one of them.

Concerning user extraction: First you should apply re.sub(pattern, repl, string) function from the re package to the raws in the Outbound message column (in a loop). The sub function works together with the compile function, where the regular expression is specified:

import re
# x - your string
# here you write your regular expression: a whitespace between a number and a word character (not syntax characters).
y = re.compile(r"([0-9])(\s)(\w)")
# here you replace the second capture group from the regex the whitespace with a _ symbol
y.sub('_\2',x)

, where you replace a whitespace with a "_". In the next step you can split the cells on the "_" character:

import re
# x - your data
re.split('_', x)

Further examples for both functions you can find here and here.

EDIT:

Since your user id doesn't always have numbers, you should apply another logic at extracting it from the Outbound message column. You can extract the first word from the string either with: x.split(' ', 1)[0], where x - your string, to get the first word of the strin, or with a regex: ^\S*

Thanks! I noticed that regular expressions show up quite a bit when researching this answer. Question: does your method rely on the last character in the user field being a number? If so, I'll have to edit my question since most of the time, that won't be the case. It can be any twitter or facebook username. Is there a way to modify the pattern to detect the first whitespace after searching for the "@" field? — Anjan G, Sep 03 '16 at 22:38
You know, I have found another logic for this. You can just use this: x.split(' ', 1)[0], where x - your string, to get the first word of the string. With regex you can do the same with ^\S* — vlad.rad, Sep 04 '16 at 07:04
And if your user id is not always at the beginning of the string, then use this regex: @\w+ — vlad.rad, Sep 04 '16 at 09:35

piRSquared · Accepted Answer · 2016-09-08T06:33:13.707

Note
You need to groupby user and df.Account because it is possible to have messages from different accounts for the same user.

# helper function to flatten multiindex objects
def flatten_multiindex(midx, sep=' '):
    n = midx.nlevels
    tups = zip(*[midx.get_level_values(i).astype(str) for i in range(n)])
    return pd.Index([sep.join(tup) for tup in tups])

in_out = ['Inbound Message', 'Outbound message']

# this gets a fresh ordering for each group
handler = lambda df: df.reset_index(drop=True)

# Use regular expression to extract user
user = df['Outbound message'].str.extract(r'(?P<User>@\w+)', expand=False)

df1 = df.groupby([user, df.Account])[in_out].apply(handler) \
        .unstack().sort_index(1, 1)
df1.columns = flatten_multiindex(df1.columns)

# I separated getting group sizes from long to wide pivot
messages = df.groupby([user, df.Account]).size().to_frame('Messages')

pd.concat([messages, df1], axis=1)

pd.concat([messages, df1], axis=1).reset_index()

Breaking down the helper function

# helper function to flatten multiindex objects
def flatten_multiindex(midx, sep=' '):

    # get the number of levels in the index
    n = midx.nlevels

    # for each level in the index, get the values and
    # convert it to strings so I can later ' '.join on it
    #
    # zip forms the tuples so I can pass each result to ' '.join
    tups = zip(*[midx.get_level_values(i).astype(str) for i in range(n)])
    # do the ' '.join and return as an index object
    return pd.Index([sep.join(tup) for tup in tups])

Thank you! This works but I still had a question - can you explain what flatten_multiindex is doing? I can visually see what it does but how does the function work? Thanks again — Anjan G, Sep 08 '16 at 03:32

Parsing text into it's own field, counting, and reshaping select fields to wide format

2 Answers2