Deriving multiple columns from single column data in python 3 pandas

Question

Source Data:

20  7369    CLERK
30  7499    SALESMAN
30  7521    SALESMAN
20  7566    MANAGER
30  7654    SALESMAN
30  7698    MANAGER
10  7782    MANAGER
20  7788    ANALYST
10  7839    PRESIDENT
30  7844    SALESMAN
20  7876    CLERK
30  7900    CLERK
20  7902    ANALYST

Requirement: 012345678901234567890123456789

Hi All,

I am reading this .dat file data into python pandas successfully. Left to right length of the data in a row is 30 (012345678901234567890123456789) My requirement is, I need to derive 3 columns

From left to right: 1 to 4 (length 4) spaces as DEPTNO 
From left to right: 5 to 13 (length 9) spaces as EMPNO 
From left to right: 14 to 30 (length 9) spaces as EMPNO

I tried this code:

import pandas as pd    
with open('Emp.dat','r') as f:
    next(f) # skip first row
    df = pd.DataFrame(l.rstrip().split() for l in f)

Required Output:

DEPTNO  EMPNO   JOB
20      7369    CLERK
30      7499    SALESMAN
30      7521    SALESMAN
20      7566    MANAGER
30      7654    SALESMAN
30      7698    MANAGER
10      7782    MANAGER
20      7788    ANALYST
10      7839    PRESIDENT
30      7844    SALESMAN
20      7876    CLERK
30      7900    CLERK
20      7902    ANALYST

score 0 · Accepted Answer · answered Jun 28 '19 at 07:54

Maybe use a columns argument:

import pandas as pd    
with open('Emp.dat','r') as f:
    next(f) # skip first row
    df = pd.DataFrame((l.rstrip().split() for l in f), columns=['DEPTNO', 'EMPNO', 'JOB'])

Output:

   DEPTNO EMPNO        JOB
0      20  7369      CLERK
1      30  7499   SALESMAN
2      30  7521   SALESMAN
3      20  7566    MANAGER
4      30  7654   SALESMAN
5      30  7698    MANAGER
6      10  7782    MANAGER
7      20  7788    ANALYST
8      10  7839  PRESIDENT
9      30  7844   SALESMAN
10     20  7876      CLERK
11     30  7900      CLERK
12     20  7902    ANALYST

score 0 · Answer 2 · answered Jun 28 '19 at 08:03

There are two ways here.

Use df = pd.read_csv('emp.dat', sep=r'\s+) to split each line on any number of white space characters (more details on that in How to make separator in pandas read_csv more flexible wrt whitespace?)
Use Fixed Width Fields df = pd.read_fwf(io.StringIO(t), width=[4,9,9])

In both ways the first line will be used as a header line. Use pd.read...(..., header=None, skiprows=[0]) to fully ignore it

Deriving multiple columns from single column data in python 3 pandas

2 Answers2