16

I frequently deal with data which is poorly formatted (I.e. number fields are not consistent etc)

There may be other ways, which I am not aware of but the way I format a single column in a dataframe is by using a function and mapping the column to that function.

format = df.column_name.map(format_number)

Question: 1 - what if I have a dataframe with 50 columns, and want to apply that formatting to multiple columns, etc column 1, 3, 5, 7, 9,

Can you go:

format = df.1,3,5,9.map(format_number)

.. This way I could format all my number columns in one line?

falsetru
  • 357,413
  • 63
  • 732
  • 636
yoshiserry
  • 20,175
  • 35
  • 77
  • 104

2 Answers2

20

You can do df[['Col1', 'Col2', 'Col3']].applymap(format_number). Note, though that this will return new columns; it won't modify the existing DataFrame. If you want to put the values back in the original, you'll have to do df[['Col1', 'Col2', 'Col3']] = df[['Col1', 'Col2', 'Col3']].applymap(format_number).

BrenBarn
  • 242,874
  • 37
  • 412
  • 384
  • can you refer to the columns by number instead of name as well when doing that? Is there also any way to programatically create that string (which would change depending on the number of columns you had) and apply the format_number function? I.e. the above would work fine if I knew exactly how many columns were in the sheet every time, but If I didn't know the number of columns, and wanted to apply the same function to every column, is there a better way of doing it? – yoshiserry Mar 03 '14 at 02:15
  • 1
    @yoshiserry: If you just want to apply it to all the columns, just do `df.applymap(format_number)`. – BrenBarn Mar 03 '14 at 02:25
3

You could use apply like this:

df.apply(lambda row: format_number(row), axis=1)

You would need to specify the columns though in your format_number function:

def format_number(row):
    row['Col1'] = doSomething(row['Col1']
    row['Col2'] = doSomething(row['Col2'])
    row['Col3'] = doSomething(row['Col3'])

This is not as elegant as @BrenBarn's answer but it has an advantage that the dataframe is modified in place so you don't need to assign the columns back again

EdChum
  • 376,765
  • 198
  • 813
  • 562
  • could you explain what does it mean that it is modified in place, I'm not sure how that differs from the above solution, given I haven't grasped what lamda does? – yoshiserry Mar 11 '14 at 12:15
  • @yoshiserry ignoring my code example, if you perform `apply` to a dataframe then the dataframe itself is modified by any changes in your function so you would not need to assign to the column, you may still need to depending on what your function is doing. The point being that you just need to call `df.apply` and not need to say do `df[['col1','col2','col3']]=df.apply(lambda row: format_number(row), axis=1))`, in my code the assignment is done by the `format_number` function so I guess the assignment is implicit rather than explicit like BrenBarn's answer – EdChum Mar 11 '14 at 12:22
  • great, and so what exactly is the function of Lamda? I still haven't grasped what lamda does? – yoshiserry Mar 11 '14 at 12:24
  • @yoshiserry there are lots of articles online such as http://pythonconquerstheuniverse.wordpress.com/2011/08/29/lambda_tutorial/, http://stackoverflow.com/questions/890128/python-lambda-why and http://www.secnetix.de/olli/Python/lambda_functions.hawk. Essentially it allows you to define an anonymous function so you don't have to define a special function or expression. In this case it captures the row and passes this as an arguement to format_number, it could easily have been `df.apply(lambda x: x **2)` to square the value – EdChum Mar 11 '14 at 13:08