7

I have a csv file that has several columns that I first delimit by colon (;). However, ONE column is delimited by a pipe | and I would like to delimit this column and create new columns.

Input:

  Column 1    Column 2      Column 3
     1           2          3|4|5
     6           7          6|7|8
     10          11         12|13|14

Desired Output:

  Column 1   Column 2      ID    Age  Height
     1          2          3      4    5 
     6          7          6      7    8
     10         11         12     13   14

My code so far delimits the first time by ; and then converts to DF (which is my desired end format)

delimit = list(csv.reader(open('test.csv', 'rt'), delimiter=';'))
df = pd.DataFrame(delimit)
DSM
  • 342,061
  • 65
  • 592
  • 494
user3682157
  • 1,625
  • 8
  • 29
  • 55
  • 1
    you can parse the last column and [split it](http://stackoverflow.com/questions/14745022/pandas-dataframe-how-do-i-split-a-column-into-two) – R Nar Nov 09 '15 at 18:25

4 Answers4

3

You didn't show exactly what the data looks like (you say it's delimited by semicolons, but your examples don't have any), but if it looks like

Column 1;Column 2;Column 3
1;2;3|4|5
6;7;6|7|8
10;11;12|13|14

You could do something like

>>> df = pd.read_csv("test.csv", sep="[;|]", engine='python', skiprows=1, 
                     names=["Column 1", "Column 2", "ID", "Age", "Height"])
>>> df
   Column 1  Column 2  ID  Age  Height
0         1         2   3    4       5
1         6         7   6    7       8
2        10        11  12   13      14

This works by using a regex separator meaning "either ; or |" and forcing the column names manually.

Alternatively, you could do it in a few steps:

>>> df = pd.read_csv("test.csv", sep=";")
>>> df
   Column 1  Column 2  Column 3
0         1         2     3|4|5
1         6         7     6|7|8
2        10        11  12|13|14
>>> c3 = df.pop("Column 3").str.split("|", expand=True)
>>> c3.columns = ["ID", "Age", "Height"]
>>> df.join(c3)
   Column 1  Column 2  ID Age Height
0         1         2   3   4      5
1         6         7   6   7      8
2        10        11  12  13     14
DSM
  • 342,061
  • 65
  • 592
  • 494
  • I get the following error when trying to run the second half of your code: TypeError: split() got an unexpected keyword argument 'expand' – user3682157 Nov 17 '15 at 18:50
  • 1
    @user3682157: you're probably using an older version of pandas. – DSM Nov 17 '15 at 19:31
0
delimit = list(csv.reader(open('test.csv', 'rt'), delimiter=';'))

for row in delimit:
    piped = row.pop()
    row.extend(piped.split('|'))

df = pd.DataFrame(delimit)

delimit ends up looking like:

[
    ['1', '2', '3', '4', '5'],
    ['6', '7', '6', '7', '8'],
    ['10', '11', '12', '13', '14'],
]
Ethan Furman
  • 63,992
  • 20
  • 159
  • 237
0

It is actually much faster to use the csv lib and str.replace:

import csv
with open("test.txt") as f:
    next(f)
    # itertools.imap python2
    df = pd.DataFrame.from_records(csv.reader(map(lambda x: x.rstrip().replace("|", ";"), f), delimiter=";"),
                                   columns=["Column 1", "Column 2", "ID", "Age", "Height"]).astype(int)

Some timings:

In [35]: %%timeit
pd.read_csv("test.txt", sep="[;|]", engine='python', skiprows=1,
                     names=["Column 1", "Column 2", "ID", "Age", "Height"])
   ....: 
100 loops, best of 3: 14.7 ms per loop

In [36]: %%timeit                                                             
with open("test.txt") as f:
    next(f)
    df = pd.DataFrame.from_records(csv.reader(map(lambda x: x.rstrip().replace("|", ";"), f),delimiter=";"),
                               columns=["Column 1", "Column 2", "ID", "Age", "Height"]).astype(int)
   ....: 
100 loops, best of 3: 6.05 ms per loop

You could just str.split:

with open("test.txt") as f:
    next(f)
    df = pd.DataFrame.from_records(map(lambda x: x.rstrip().replace("|", ";").split(";"), f),
                                   columns=["Column 1", "Column 2", "ID", "Age", "Height"])
Padraic Cunningham
  • 176,452
  • 29
  • 245
  • 321
0

Figured out a solution for myself:

df = pd.DataFrame(delimit)
s = df['Column 3'].apply(lambda x: pd.Series(x.split('|')))
frame = pd.DataFrame(s)
frame.rename(columns={0: 'ID',1:'Height',2:'Age'}, inplace=True)
result = pd.concat([df, frame], axis=1)
user3682157
  • 1,625
  • 8
  • 29
  • 55