2

first time poster and python beginner so hopefully asking correctly.

I need to loop through all rows in a dataframe, checking for a string match in one column. If there is a match then I want to insert a date into a new column, if not then use a different date.

I need to iterate through the rows as each time the condition is met I want to advance the date by one day.

For now I'm just trying to make sure the string match is working so I'm just inserting two different strings depending on the match result. The problem is that my condition seem to never be met. I can't see where its going wrong.

My code so far is

for index in df.index:
    if df.loc[index, 'time'] == '00:00':
        df['date'] = "wooo"
    else:
        df['date'] = "blah"

The result I'm seeing is in the image below. As you can see it is not picking up the row that matches the condition.

enter image description here

hc_dev
  • 8,389
  • 1
  • 26
  • 38
user18216948
  • 33
  • 1
  • 5

2 Answers2

1

df['date'] = "wooo" sets the entire column to "wooo", which I'm sure is not your intention. You should just set that row's date column

for index in df.index:
    if df.loc[index, 'time'] == '00:00':
        df.loc[index, 'date'] = "wooo"
    else:
        df.loc[index, 'date'] = "blah"

Better yet, use logical indexing and set multiple rows at once instead of iterating over rows and setting each value individually:

df['date'] = 'blah'                 # Set everything to blah
zero_rows = df['time'] == '00:00'   # Find the rows where time is 00:00
df.loc[zero_rows, 'date'] = 'wooo'  # Set these rows to wooo

For example, with

df = pd.DataFrame([['00:00', 2, 3], ['00:01', 5, 6], ['00:02', 8, 9], ['00:00', 10, 11]], columns=['time', 'b', 'c'])

we get:

    time   b   c  date
0  00:00   2   3  wooo
1  00:01   5   6  blah
2  00:02   8   9  blah
3  00:00  10  11  wooo

Re. your comment:

The issue with the logical indexing is that eventually I’m trying to assign a date which counts forwards at the start of each new day (time = 00:00), so until a 00:00 is found insert todays date in each row. Once found, insert today+1 and continue until found again and then insert today+2 etc. The number of rows until the first 00:00 is found varies depending on when the dataframe is created.

This is easily implemented. Let's start with a longer dataframe:

df = pd.DataFrame([['00:00', 2, 3], ['12:00', 5, 6], ['00:00', 8, 9], ['12:00', 10, 11], ['00:00', 18, 19], ['12:00', 110, 111], ['00:00', 28, 29], ['12:00', 210, 211]], columns=['time', 'b', 'c'])

Let's pull out all the columns with time == '00:00'.

zero_rows = df['time'] == '00:00'
midnights = df.loc[zero_rows, :]

    time   b   c
0  00:00   2   3
2  00:00   8   9
4  00:00  18  19
6  00:00  28  29

Now, create a new column 'increment' and fill it with nan. Then, fill only the midnight values with numbers:

if df.iloc[0]['time'] == '00:00':
    start_increment = 0 # First row is midnight, so we start with an increment of zero
else:
    start_increment = 1 # First row is not midnight, so the first midnight row has an increment of 1

df['increment'] = pd.NA
df.loc[midnights.index, 'increment'] = range(start_increment, len(midnights)+start_increment)

    time    b    c increment
0  00:00    2    3         0
1  12:00    5    6      <NA>
2  00:00    8    9         1
3  12:00   10   11      <NA>
4  00:00   18   19         2
5  12:00  110  111      <NA>
6  00:00   28   29         3
7  12:00  210  211      <NA>

Next, we can use ffill() to fill values in the increment column that are NA.

df['increment'].ffill(inplace=True)

    time    b    c  increment
0  00:00    2    3          0
1  12:00    5    6          0
2  00:00    8    9          1
3  12:00   10   11          1
4  00:00   18   19          2
5  12:00  110  111          2
6  00:00   28   29          3
7  12:00  210  211          3

Any NAs that still remain were before the first midnight, so they should be zeros:

df['increment'].fillna(0, inplace=True)

Now, we want to add a column 'date' which has a value of today + increment days. To do this, we first convert the increment column to pd.TimeDelta:

df['increment'] = pd.to_timedelta(df['increment'], unit='D')

    time    b    c increment
0  00:00    2    3    0 days
1  12:00    5    6    0 days
2  00:00    8    9    1 days
3  12:00   10   11    1 days
4  00:00   18   19    2 days
5  12:00  110  111    2 days
6  00:00   28   29    3 days
7  12:00  210  211    3 days

Finally, add today's date to the increment column:

import datetime
df['date'] = datetime.datetime.today() + df['increment']

    time    b    c increment       date
0  00:00    2    3    0 days 2022-02-15
1  12:00    5    6    0 days 2022-02-15
2  00:00    8    9    1 days 2022-02-16
3  12:00   10   11    1 days 2022-02-16
4  00:00   18   19    2 days 2022-02-17
5  12:00  110  111    2 days 2022-02-17
6  00:00   28   29    3 days 2022-02-18
7  12:00  210  211    3 days 2022-02-18
Pranav Hosangadi
  • 23,755
  • 7
  • 44
  • 70
  • Thank you, I will try your first suggestion tomorrow. The issue with the logical indexing is that eventually I’m trying to assign a date which counts forwards at the start of each new day (time = 00:00), so until a 00:00 is found insert todays date in each row. Once found, insert today+1 and continue until found again and then insert today+2 etc. The number of rows until the first 00:00 is found varies depending on when the dataframe is created. – user18216948 Feb 15 '22 at 21:54
  • I've tried your solution and no luck, it seems that it just cannot match the string value. I have attempted to cast the column as a string to make sure it CAN match but it doesn't seem to want to. The dataframe is created by scraping data off a public URL. Would it help if I shared the code to get the data so you can see it as I do? – user18216948 Feb 15 '22 at 22:17
  • I looked into why the match wasn't working. I tried to copy and past the string in directly from the data to the coded but no luck. I then wondered if the were hidden characters and so did a .strip() on the time column. This has fixed it and your solution adding the .loc to my original code worked. Thank you for helping out a newbie. Is there a way to give you +ve feedback on your profile? – user18216948 Feb 15 '22 at 22:27
  • 1
    @user18216948 great! I'm glad this helped. Good job debugging the issue. Re. "insert today's date and increment every time you see a `"00:00"`: I'm writing an edit to my answer that will show you how to do this without iterating over all rows. – Pranav Hosangadi Feb 15 '22 at 22:52
  • 1
    @user18216948 take a look at the updated answer now. – Pranav Hosangadi Feb 15 '22 at 23:14
  • Just tried the revised solution. Awesome, exactly what I needed. Thank you! – user18216948 Feb 16 '22 at 18:45
0

You can use apply with a Pythonic ternary-expression inside a lambda.

Try in the Python shell:

>>> import pandas as pd
>>> df = pd.DataFrame([['00:00', 2, 3], ['00:01', 5, 6], ['00:02', 8, 9], ['00:00', 10, 11]], columns=['time', 'b', 'c'])
>>> df['date'] = df.time.apply(lambda t: 'wooo' if t == '00:00' else 'blah')
>>> df
    time   b   c  date
0  00:00   2   3  wooo
1  00:01   5   6  blah
2  00:02   8   9  blah
3  00:00  10  11  wooo
hc_dev
  • 8,389
  • 1
  • 26
  • 38
  • Thank you but I need to iterate through line by line, not apply in one go. The other answer in the thread solves my issue but I thank you for your reply. – user18216948 Feb 15 '22 at 22:32
  • @user18216948 The match-based date-insertion you asked for is done. For the rolling-day increment it would help to clarify in what relation the `datetime collected` and `time` columns are. Or where you got this data from (the API or URL-query). – hc_dev Feb 15 '22 at 22:45