1

This is how my data looks,

emp_id  col1  col2  col3
1234,abc|de,2020|2011,89
5639,ma,2010|2019,90

This is how data need to be changed and saved into the file

 emp_id  col1  col2  col3
 1234    abc   2020  89
 1234    abc   2011  89
 1234    de    2020  89
 1234    de    2011  89
 5639    ma    2010  90
 5639    ma    2019  90

Is there an easy way to do this in python?

Jennie
  • 225
  • 2
  • 13
  • use string split https://docs.python.org/3/library/stdtypes.html#str.split and https://docs.python.org/3/library/stdtypes.html#str.join join. If you want data in a fixed width format.. use {:n} with an f string https://stackoverflow.com/questions/14776788/python-how-can-i-pad-a-string-with-spaces-from-the-right-and-left https://realpython.com/python-f-strings/ and finally https://www.geeksforgeeks.org/reading-writing-text-files-python/ – JGFMK Sep 07 '20 at 17:32
  • I need to add the splitted multivalues to the new rows – Jennie Sep 07 '20 at 17:33
  • right so as you read each row split first by the commas.. then with 2nd/3rd cols.. split again with | and nest inside of loops. So as you read from one file - you manipulate the line you read, format it and output it to a second file. All the links will give you the building blocks to achieve what you desire. – JGFMK Sep 07 '20 at 17:35
  • `df.transform(lambda x: x.str.split('|') if x.name in ['col1', 'col2'] else x).explode('col1').explode('col2')` – Onyambu Sep 07 '20 at 18:44
  • Does this answer your question? [Replace the pattern in pandas datframe](https://stackoverflow.com/questions/63782954/replace-the-pattern-in-pandas-datframe) – JGFMK Sep 07 '20 at 19:17
  • Actually that question is to replace patterns and this one is to split the multivalued columns :-( – Jennie Sep 07 '20 at 19:37
  • @JGFMK Am I causing confusion – Jennie Sep 07 '20 at 19:38

1 Answers1

0

I don't think I have the easiest way but the following code works on your example:

import pandas as pd
# read your example
df = pd.read_csv(
    io.StringIO(
        r"""emp_id,col1,col2,col3
1234,abc|de,2020|2011,89
5639,ma,2010|2019,90"""
    )
)

# split and expand the column with pipe sign
# expanded_col1 and expanded_col2 are dataframes
# rename the column in order to find them after
expanded_col1 = df.col1.str.split('|', expand=True).rename(
    lambda x: f'col1_{x}', axis='columns'
)
expanded_col2 = df.col2.str.split('|', expand=True).rename(
    lambda x: f'col2_{x}', axis='columns'
)

# create all combinations from values of string split
to_concat = []
for col1, col2 in itertools.product(expanded_col1, expanded_col2):
    to_concat.append(
        pd.concat(
            [
                # put back the initial column name
                expanded_col1.loc[:, [col1]].rename(
                    lambda x: x.split('_')[0], axis='columns'
                ),
                expanded_col2.loc[:, [col2]].rename(
                    lambda x: x.split('_')[0], axis='columns'
                ),
            ],
            axis='columns',
        ).dropna()
    )


result = pd.merge(  # merge combinations and other columns
    df.drop(['col1', 'col2'], axis='columns'),  # drop initial split columns
    pd.concat(to_concat), # concat all combinations
    left_index=True,
    right_index=True,
)

Result:

   emp_id  col3 col1  col2
0    1234    89  abc  2020
0    1234    89  abc  2011
0    1234    89   de  2020
0    1234    89   de  2011
1    5639    90   ma  2010
1    5639    90   ma  2019
ndclt
  • 2,590
  • 2
  • 12
  • 26