How to split in chunks a pandas dataframe by it's strings?

Question

I have a pandas dataframe which I generated by appending a sequence of lists, composed mainly by strings which have a separator ("'\n'") like this:

   content

0   American Regent/Luitpold (Reverified 10/26/2016)\nCompany Contact Information:\n800-645-1706\n\nPresentation Availability and Estimated Shortage Duration Related Information Shortage Reason (per FDASIA)\n2 mL single-dose vial, package of 10 (NDC 00517-2502-10) Available for NDC 00517-2502-10. Demand increase for the drug
1   Amphastar Pharmaceuticals, Inc./IMS (Reverified 08/18/2016)\nCompany Contact Information:\n800-423-4136\n\nPresentation Availability and Estimated Shortage Duration Related Information Shortage Reason (per FDASIA)\nCalcium Chloride Inj. USP, 10%, 10mL Luer-Jet Prefilled Syringe, (NDC 0548-3304-00), new (NDC 76329-3304-1) Product available Demand increase for the drug\nHospira, Inc. (Reverified 10/21/2016)
2   American Regent/Luitpold (Reverified 10/26/2016)\nCompany Contact Information:\n800-645-1706\n\nPresentation Availability and Estimated Shortage Duration Related Information Shortage Reason (per FDASIA)\n10%, 50 mL vial; Calcium (0.465 mEq/mL), Preservative Free (NDC 0517-3950-25) Unavailable for NDC 00517-3950-25. No product available for release. No plan to manufacture. American Regent is currently not releasing Calcium Gluconate 50 mL vial (NDC 00517-3950-25). Other\n10%, 100 mL vial; Calcium (0.465 mEq/mL), Preservative Free (NDC 0517-3900-25) Unavailable for NDC 00517-3900-25. American Regent is currently not releasing Calcium Gluconate 100 mL vial (NDC 0517-3900-25). Other\nFresenius Kabi USA, LLC (Revised 11/01/2016)
 .......
n   Apotex Corp. (Revised 05/16/2016)\nCompany Contact Information:\n800-706-5575\n\nPresentation\n1gm; (25 Vials) (NDC 60505-0749-5)\n1gm; (25 Vials)(NDC 60505-6093-5)\n10 gm; (10 Vials) (NDC 60505-0769-0)\n10 gm; (10 Vials) (NDC 60505-6094-0)\nNote:\nAvailable\nB. Braun Medical Inc. (Revised 05/16/2016)\n\n\nBaxter Healthcare (Revised 05/16/2016)\n\n\nFresenius Kabi USA, LLC (Revised 05/16/2016)\n\n\nHospira, Inc. (Revised 05/16/2016)\n\n\nSagent Pharmaceuticals (Revised 05/16/2016)\n\n\nSandoz (Revised 05/16/2016)\n\n\nWest-Ward Pharmaceuticals (Revised 05/16/2016)\n\n\nWG Critical Care (Revised 05/16/2016)
n-1 Apotex Corp. (Reverified 10/26/2016)\nCompany Contact Information:\n800-706-5575\n\nPresentation Availability and Estimated Shortage Duration Related Information Shortage Reason (per FDASIA)\nCefepime for Injection, USP 1 gm (10 Vials) (NDC 60505-6030-4) On backorder. Shortage duration is unknown. Requirements relating to complying with current good manufacturing practices (cGMP).\nCefepime for Injection, USP 2 gm (10 Vials)(NDC 60505-6031-4) On backorder. Shortage duration is unknown. Requirements relating to complying with current good manufacturing practices (cGMP).\nCefepime for injection, USP 1 gm (10 Vials) (NDC 60605-0834-04) On backorder. Shortage duration is unknown. Requirements relating to complying with current good manufacturing practices (cGMP).\nCefepime for injection, USP 2 gm (10 Vials) (NDC 60505-0681-4) On backorder. Shortage duration is unknown. Requirements relating to complying with current good manufacturing practices (cGMP).\nCefepime for injection, USP 1 gm (1 Vial) (NDC 60505-0834-00) On backorder. Shortage duration is unknown. Requirements relating to complying with current good manufacturing practices (cGMP).\nCefepime for injection, USP 2 gm (10 Vials) (NDC 60505-0681-0) On backorder. Shortage duration is unknown. Requirements relating to complying with current good manufacturing practices (cGMP).\nB. Braun Medical Inc. (New 07/22/2015)\n\n\nBaxter Healthcare (Reverified 10/25/2016)\n\n\nFresenius Kabi USA, LLC (Revised 11/01/2016)\n\n\nHospira, Inc. (Reverified 10/21/2016)\n\n\nSagent Pharmaceuticals (Revised 08/29/2016)\n\n\nWG Critical Care (Revised 06/08/2016)

How can I separate by its new line \n the content of the dataframe in more columns:

   col1              col2        col3        col4
0  Shire US Inc. (Reverified 07/01/2016)   and so  on.... 
1  Hospira, Inc. (Reverified 10/21/2016)   and so  on....  
2  Mission Pharmacal (Reverified 01/21/2015)   and so  on....  
....
n  Mission Pharmacal (Reverified 01/21/2015)   and so  on....

I tried to:

df['col'] = df['content'].str.split('\n', expand = true)

And apparently, I am getting a wrong number of items, passed 45, placement implies 1. Also since I am doing:

df = pd.DataFrame(lis, columns = ['content'])

I am unable to do use sep.

When you loaded the pandas dataframe, you have an option to define a delimiter, which in this case you can use '\n'. Let me know if this helps. — BernardL, Nov 04 '16 at 02:32
I all ready tried to do sep = '\n', it did not worked! @BernardL — john doe, Nov 04 '16 at 02:33

score 1 · Accepted Answer · edited May 23 '17 at 10:32

1

Similar question here

df = pd.DataFrame(['The quick brown\n fox jumps \nover the \n lazy dog',
'The quick brown\n fox jumps \nover the \n lazy dog',
'The quick brown\n fox jumps \nover the \n lazy dog','The quick brown\n fox jumps \nover the \n lazy dog'], columns = ['data'])

foo = lambda x: pd.Series([i for i in reversed(x.split('\n'))])
rev = df['data'].apply(foo)

EDIT After talking through here is updated code that will load multiple files into a single dataframe:

allFiles_df = None
for it, currFile in enumerate(files):

    df = pd.read_csv(currFile, sep = '\n', header = None)
    df.columns = ['data']

    splitFunc = lambda x: pd.Series([i for i in reversed(x.split('\\n'))])

    df = df['data'].apply(splitFunc)
    df = df.stack().to_frame().reset_index().drop(['level_1'],axis = 1)
    df = df[df[0].str.len() >2]
    df['fileNo'] = it

    allFiles_df = pd.concat([allFiles_df,rev])

allFiles_df.columns = ['rowNo','text','fileNo']

key things to note: the '\n' is text in your raw data, so it gets read into python as '\\n'. The sep keyword in read_csv does not allow for separation on multiple characters, that's why you're having issues with that.

This will output the file and row number each string was found in. It assumes the files variable contains a list of file names with paths.

edited May 23 '17 at 10:32

Community

1
1

answered Nov 04 '16 at 02:35

flyingmeatball

7,457
7
44
62

thanks!, it actually is a nested list. Not a string! – john doe Nov 04 '16 at 02:36
1

The example above looks like you have either a series or a 1d dataframe. Please post raw data that reflects yours, similar to how I have created a df for users to work with. In the future, creating raw data that is easy for users to play with will get you the best/fastest results. – flyingmeatball Nov 04 '16 at 02:43
I tried to : `normalize = lambda x: pd.Series([i for i in reversed(x.split('\n'))]) newdf = df['0','1'].apply(normalize) print(newdf)` and got `KeyError ('0', '1')`. – john doe Nov 04 '16 at 02:44
I tried to df[['col1', 'col2', 'col3', 'col4']] and the same error repeats. – john doe Nov 04 '16 at 02:48
1

Edit your data up above so we can seek what you're working with. Do like I did and create a df that mirrors your raw data. **EDIT** we don't need to see your entire data, just like 3 rows of fake data that is in a similar format to yours and will let you apply the results your your full dataset. – flyingmeatball Nov 04 '16 at 02:49
That still isn't formatted correctly for users to work with. Notice the example I put, you can copy - paste into a python script and work directly with it (df = pd.DataFrame(...)). Yours is just text formatted for a window. If it's in a list, it should be myList = [Shire US Inc....,Hospira...,etc]. etc. If it's in a series, pd.Series(['blablabla','blabla']) – flyingmeatball Nov 04 '16 at 03:03
I updated the data, the problem was that the dataframe didn't let me to show the full content – john doe Nov 04 '16 at 03:08
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/127340/discussion-between-flyingmeatball-and-john-doe). – flyingmeatball Nov 04 '16 at 03:10
Thanks!, I will check the solution – john doe Nov 04 '16 at 17:11
1

Please give it a green check if it gets you on your way to answering the question posed above. – flyingmeatball Nov 04 '16 at 18:10

How to split in chunks a pandas dataframe by it's strings?

1 Answers1