How to convert a series object in to data frame using string cleaning

Question

I have a series object of strings where there is a specific characters i can go along with. For instance, the one with the end character of [] will be corresponded to those with end character of ()

s = pd.Series(['September[jk]', 'firember hfh(start)','secmber(end)','Last day(hjh)',
              'October[jk]','firober fhfh (start)','thber(marg)','lasber(sth)',
              'December[jk]','anober(start)','secber(start)','Another(hkjl)'])

I can simply clean the data but these characters at the end should help me build the resulting data frame like this

0   September   firember hfh
1   September   secmber
2   September  Last day
3    October   firober fhfh
4    October     thber
5    October    lasber
6   December    anober
7   December    secber
8   December   Another

How do you decide that `firober fhfh (start)` becomes only `firober`? Whereas `Last day(hjh)` becomes `Last day` — Julien Marrec, Dec 08 '16 at 23:37
Sorry, Firober only. those trancated should be with the symbol. Thank you — Enqu T. Job, Dec 08 '16 at 23:47

score 0 · Answer 1 · answered Dec 08 '16 at 23:56

I don't think there's any magic here, so I recommend parsing the list yourself before creating the dataframe:

import re
import pandas as pd

l = ['September[jk]', 'firember hfh(start)','secmber(end)','Last day(hjh)',
              'October[jk]','firober fhfh (start)','thber(marg)','lasber(sth)',
              'December[jk]','anober(start)','secber(start)','Another(hkjl)']

month = None
mylist = []
for i, el in enumerate(l):
    m = re.match('(.*?)\[.*?\]', el)
    if m:
        month = m.groups()[0]
    else:
        m = re.match('(.*?)\(.*?\)', el)
        if m:
            mylist.append({'Month':month, 'Value':m.groups()[0]})
        else:
            print("Cannot find a match for {}".format(el))

df = pd.DataFrame(mylist)
print(df)

Out:

       Month          Value
0  September   firember hfh
1  September        secmber
2  September       Last day
3    October  firober fhfh 
4    October          thber
5    October         lasber
6   December         anober
7   December         secber
8   December        Another

Side note: I used the re library for regex because it could be adapted to many more complex situations, but in your case you could just use the built-in functions, with in and split:

for i, el in enumerate(l):
    if '[' in el:
        month = el.split('[')[0]
    else:
        if '(' in el:
            mylist.append({'Month':month, 'Value':el.split('(')[0]})
        else:
            print("Cannot find a match for {}".format(el))

Thank you. Julien It is a good way. but in my case it is rejecting it for using a string pattern on a byte like object. even if i append "b" in it it rejects. Huge thanks for your help once again — Enqu T. Job, Dec 09 '16 at 00:27
Have you tried running just my code (where I define the list as strings)? That should work. What version of python are you on? — Julien Marrec, Dec 09 '16 at 00:28
python 3. The code is absolutely correct it runs properly. but i hav no idea how it says 'numpy.int64' is not iterable. when i convert the series in to string it again says using byte type pattern on string object — Enqu T. Job, Dec 09 '16 at 00:44
If you print your list, do you see something like `[b'September[jk]', b'firember hfh(start)'`? If you do `type(l[0])` does it return `bytes`? In which case right after `for i, el in enumerate(l)` add `el = el.decode('utf-8')`. See [this question](http://stackoverflow.com/questions/606191/convert-bytes-to-a-python-string) — Julien Marrec, Dec 09 '16 at 00:56

How to convert a series object in to data frame using string cleaning

1 Answers1