1

I have a Series/DataFrame such as this one. The elements contained in them are lists with one or more values:

0      NaN
1     [40]
2      NaN
3      NaN
4      NaN
5      NaN
6      NaN
7      NaN
8      NaN
9     [35]
10     NaN
11     NaN
12    [28]
13     NaN
14     NaN
15     NaN
16     NaN
17     NaN
Name: tags, dtype: object

I would like to fill in the missing values with the latest value upto five consecutive entries. A ffill with limit 5 would be most appropriate. However my use case is such that I want the forward fill to overlap. My expected output would look something like this:

0          NaN
1         [40]
2         [40]
3         [40]
4         [40]
5         [40]
6         [40]
7          NaN
8          NaN
9         [35]
10        [35]
11        [35]
12        [28]
13    [35, 28]
14    [35, 28]
15        [28]
16        [28]
17        [28]
Name: tags, dtype: object

The example above is for the sake of simplicity, and this function I am describing is to be part of a much larger pd.groupby operation with several more tags and as such python loops aren't much help. I don't care about the indices with the tags themselves, only those that are being filled are important to me. Maybe an approach with pandas cumsum and slicing out based on index difference would work here?

Any ideas to approach this problem would prove incredibly helpful to me. Thanks in advance!

Kael
  • 92
  • 1
  • 8

2 Answers2

1

You could try:

# fill na by empty list 
df['tags'] = [[] if na else s for s, na in zip(df['tags'], df['tags'].isna())]

# compute rolling windows
df['res'] = [[l for ls in window for l in ls] for window in df['tags'].rolling(5)]
print(df)

Output

    tags       res
0     []        []
1   [40]      [40]
2     []      [40]
3     []      [40]
4     []      [40]
5     []      [40]
6     []        []
7     []        []
8     []        []
9   [35]      [35]
10    []      [35]
11    []      [35]
12  [28]  [35, 28]
13    []  [35, 28]
14    []      [28]
15    []      [28]
16    []      [28]
17    []        []

As an alternative, you could use chain.from_iterable:

from itertools import chain

# compute rolling windows
df['res'] = [list(chain.from_iterable(window)) for window in df['tags'].rolling(5)]

See this answer for a comparison on several methods for list flattening in pandas.

Dani Mesejo
  • 61,499
  • 6
  • 49
  • 76
  • This approach works well without any issues of any sort. Many thanks! Although it's a bit slower compared to the shift based approach. The functionality that I desire is to be part of a much larger groupby operation. It takes two minutes compared to 30s. However the problem with shift based approach is that it runs into memory issues when the window size is large (even 20). I loop through the required number of shifts and accumulate the tags. Can't understand why this would happen? – Kael Nov 29 '20 at 04:38
  • I had a bug in my code previously now works fine! Your solution is still faster for larger window sizes. – Kael Nov 29 '20 at 06:20
0
# init the DataFrame
temp = pd.DataFrame({"tags":[
    np.nan, [40], np.nan, np.nan, np.nan, 
    np.nan, np.nan, np.nan, np.nan, [35], 
    np.nan, np.nan, [28], np.nan, np.nan, 
    np.nan, np.nan, np.nan]})

# initialize the result with empty lists for list concatenation
temp['ctags'] = temp['tags'].apply(lambda x: [] if type(x) == float else x)

window = 5
for i in range(1, window):
    temp['ctags'] = temp['ctags'] + temp['tags'].shift(i).apply(lambda x: [] if type(x) == float else x)

temp['ctags']

Gives output:

0           []
1         [40]
2         [40]
3         [40]
4         [40]
5         [40]
6           []
7           []
8           []
9         [35]
10        [35]
11        [35]
12    [28, 35]
13    [28, 35]
14        [28]
15        [28]
16        [28]
17          []

I was able to come up with this quick solution for my problem. But the problem here is that it isn't as efficient as I would like it to be, plus if I were to increase fill limit to say 10, it would get even less efficient than it is already.

EDIT: Added loop for re-usability. Accumulates the solution, hence more memory efficient.

Kael
  • 92
  • 1
  • 8