0

I have a dataframe with address data, but some data is in the wrong column and they need to get shifted so the first field is always the first row of the address.

A somewhat simplified version of the dataframe looks like this:

import pandas as pd

addresses = {'add1': ['1', '54', '12-18', ' '], 'add2': ['moore street', 'fountain road', 'st margaret house', '15']}

df = pd.DataFrame(addresses)

To solve this, I created another column (a1,a2) that concatenates the values from add1 and add2 with a column, and a few other lines to strip out extraneous commas and double commas, with the aim of then using this to split the data out again using str.split() and the commas as delimiters.

df['a1,a2'] = df['add1'].astype(str) + ',' + df['add2'].astype(str) + ','

df['a1,a2'] = df['a1,a2'].str.lstrip(',')

df.replace(',,', ',', regex=True, inplace=True)

df['a1,a2'] = df['a1,a2'].str.rstrip(',')
df['a1,a2'] = df['a1,a2'].str.rstrip('"')
df['a1,a2'] = df['a1,a2'].str.lstrip('"')

So far so good. When I export this to a csv file and then split out the columns in Excel I get the right result.

So then I add row to split them out in the dataframe, into new columns

df['a1,a2'].str.split(',', expand=True)

This runs fine, and I get no errors but the resultant file still has the new column as one, not split across new columns.

When I open the csv output in notepad I can see that the new column 'a1,a2' is enclosed in quote marks ("a1,a2") and the same for the data rows

I assume that when the columns are concatenated it makes a single list (?) and so ignores the delimiters within it, but I cannot figure out how to get it to be a normal string so it will split.

Catríona
  • 23
  • 5
  • 1) `df['a1,a2'].str.split(', ')` is split by `, ` (tailing space), but your `a1,a2` column doesn't have `, `(comma space). 2) `df['a1,a2'].str.split(', ')` is not inplace, you need assign the result to new column. – Ynjxsjmh Jun 30 '22 at 12:19
  • Sorry, that's just a typo when I wrote it out here. There isn't a trailing space in the proper version. As I understand it, the expand=True ought to pull them out to new columns? – Catríona Jun 30 '22 at 12:28

1 Answers1

1

OK, thanks to the clue from Ynjxsjmh I managed to work it out (or rather, google the right thing).

The split columns are now a new dataframe and I renamed the source column so that it was easier to reference

expanded = df.concat.str.split(',', expand=True)

then renamed the columns

expanded = expanded.rename(columns={0: 'add1', 1:'add2'})

and added them back into the source dataframe, then dropped the original address columns and exported to csv to check the data.

A bit more info is here

Catríona
  • 23
  • 5