9

I see it's possible to append using the series namespace (https://stackoverflow.com/a/70599059/5363883). What I'm wondering is if there is a similar method for appending or concatenating DataFrames.

In pandas historically it could be done with df1.append(df2). However that method is being deprecated (if it hasn't already been deprecated) for pd.concat([df1, df2]).

df1

a b c
1 2 3

df2

a b c
4 5 6

res

a b c
1 2 3
4 5 6
Cornelius Roemer
  • 3,772
  • 1
  • 24
  • 55
cnpryer
  • 195
  • 1
  • 1
  • 7
  • Closest thing I could find that doesn't seem documented in the cookbook would be [.extend()](https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.DataFrame.extend.html?). – cnpryer Mar 29 '22 at 00:14

2 Answers2

25

There are different append strategies depending on your needs.

df1 = pl.DataFrame({"a": [1], "b": [2], "c": [3]})
df2 = pl.DataFrame({"a": [4], "b": [5], "c": [6]})


# new memory slab
new_df = pl.concat([df1, df2], rechunk=True)

# append free (no memory copy)
new_df = df1.vstack(df2)

# try to append in place
df1.extend(df2)

To understand the differences, it is important to understand polars memory is immutable iff it has any copy.

Copies in polars are free, because it only increments a reference count of the backing memory buffer instead of copying the data itself.

However, if a memory buffer has no copies yet, e.g. the refcount == 1, we can mutate polars memory.

Knowing this background there are the following ways to append data:

  • concat -> concatenate all given DataFrames. This is sort of a linked list of DataFrames. If you pass rechunk=True, all memory will be reallocated to contiguous chunks.
  • vstack -> Adds the data from other to DataFrame by incrementing a refcount. This is super cheap. It is recommended to call rechunk after many vstacks. Or simply use pl.concat.
  • extend This operation copies data. It tries to copy data from other to DataFrame. If however the refcount of DataFrame is larger than 1. A new buffer of memory is allocated to hold both DataFrames.
ritchie46
  • 10,405
  • 1
  • 24
  • 43
  • Wow. Not sure how I missed `pl.concat`. Sounds like `rechunk=True` is what I was looking for too. – cnpryer Mar 29 '22 at 14:58
  • if `rechunk=True`, does `concat` copies date to the new allocated memory region? Then what is the difference from `extend` – ywat Feb 13 '23 at 21:42
  • 1
    @ritchie46 After doing vstack of 2 dfs, if we need to write_ipc the df to disk, is rechunk necessary to be called? – Ethan Feb 18 '23 at 12:58
2

Looks like .extend() mutates df1 to extend its memory to df2.

import polars as pl

df1 = pl.DataFrame({"a": [1], "b": [2], "c": [3]})
df2 = pl.DataFrame({"a": [4], "b": [5], "c": [6]})
df1.extend(df2)

┌─────┬─────┬─────┐
│ a   ┆ b   ┆ c   │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 1   ┆ 2   ┆ 3   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 4   ┆ 5   ┆ 6   │
└─────┴─────┴─────┘

This makes sense, but if I wanted to create a totally distinct (in memory too) df3 I'm guessing it'd be

import polars as pl

df1 = pl.DataFrame({"a": [1], "b": [2], "c": [3]})
df2 = pl.DataFrame({"a": [4], "b": [5], "c": [6]})

df3 = pl.from_records(df1.to_numpy(), columns=["a", "b", "c"])
df3.extend(df2)

┌─────┬─────┬─────┐
│ a   ┆ b   ┆ c   │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 1   ┆ 2   ┆ 3   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 4   ┆ 5   ┆ 6   │
└─────┴─────┴─────┘

Any feedback on less verbose methods would be appreciated.

cnpryer
  • 195
  • 1
  • 1
  • 7
  • Playing around with this, this seems to create extended `df1` as entirely separate from `df2` memory. Correct me if I'm wrong. – cnpryer Mar 29 '22 at 02:19