1

I have a txt. file that looks as follows:

Name 1
                                                           
                                                           
   @Name( )                  Value       WATER       WHP     
                  Date       Unit         Unit       Unit     
-------------- ---------- ---------- ---------- ---------- 
Name 1       20081220      2900.00       0.00    3300.00 
Name 1       20081221      0.00          0.00    3390.00 
Name 1       20081222      2500.00       0.00    2802.00 
Name 1       20081223      0.00          0.00    3022.00
Name 1       20081224      0.00          0.00    3022.00

I used the following code to import into python:

df = pd.read_csv(r'test_prd.txt', skiprows=6, engine="python", header=None)
df.columns = ['Test']
df.drop(df.tail(1).index, inplace = True) # because of file format
df = df.Test.str.split(expand=True)

df.rename(columns ={0:'Name', 1:'Number', 2:'Date', 3:'Value', 4:'Water', 5:'WHP'}
,inplace=True)
df['Date'] = pd.to_datetime(df['Date']).dt.floor('D').dt.strftime('%Y-%m-%d')
df['Note'] = (df['Value']).apply(lambda x: 'yes' if x==0 else '')
del df['Water']
del df['WHP']
    
df['Name'] = df['Name'].astype(str) + ' ' + df['Number'].astype(str)
del df['Number']

After using this code the data frame looks like:

           Name      Date       Value       Note
    0     Name 1  2008-12-20    2900.00      
    1     Name 1  2008-12-21    0.00         Yes
    2     Name 1  2008-12-22    2500.00      
    3     Name 1  2008-12-23    0.00         Yes
    4     Name 1  2008-12-24    0.00         Yes 
   ...    ...        ...        ...          ...
    78    Name    2009-03-15    0.00         Yes
    79    Name    2009-03-16    3000.00      
    80    Name    2009-03-17    0.00         Yes
   ...    ...        ...        ...          ...

I want to print the periods of time (start date - end date) for which the 'Value' column equals zero, i.e, when 'Note'=Yes. Any other row were the value is non-zero can be removed from the data frame. If there is a standalone value of zero (preceded and followed by a non-zero value), the start and end date would be the same.

The expected output should look like this:

      Name     Start Date      End Date     Value       Note
1     Name     2008-12-21    2008-12-21     0.00         Yes
2     Name     2008-12-23    2009-03-15     0.00         Yes
3     Name     2009-03-17        ***        0.00         Yes
       ...        ...           ...         ...          ...

I was trying to use a conditional if statement or df.loc but I don't know my way around Python enough to put it together. Any advice would be appreciated.

kn2298
  • 49
  • 5

2 Answers2

1

First let's import your dataframe using read_csv

df = PD.read_csv("yourfile.txt", sep="\s+", engine="python", parse_dates=["Date"])
df["Value"] = df["Value"].astype(float)

Be sure to replace the value of sep with the correct column separator. Here I assume the separator is one or more whitespaces, if not please adapt it.
Be also sure to convert the "Date" column to a datetime, by using the parse_dates parameter, and that "Value" column is of type float.

Now with df being your dataframe, this snippet should do what you want.

df["Start"] = (df["Value"] == 0) & (df["Value"].shift(1) != 0)
ddf = df[df["Value"] == 0]
ddf["Group"] = ddf["Start"].cumsum()
rdf = ddf.groupby("Group").apply(lambda x: PD.Series({"Name":x["Name"].iloc[0],
                                                      "Start Date":x["Date"].min(),
                                                      "End Date":x["Date"].max(),
                                                      "Value": 0.,
                                                      "Note": "Yes",
                                                     })).reset_index(drop=True)

The point here is to use some pandas function to achieve what you want in an efficient way. Don't use loops, if your dataframe is big, you'll need a lot of time to execute your code.

  1. Here I first create a "Start" column where I check which row is the starting of a zero interval series of rows. I do it by shifting the "Value" row forward by 1 position, and comparing each row. "Start" column has a True value for each row where an interval should be started.
  2. Then I remove the non zero "Value" rows.
  3. Then I use cumsum to sum over the "Start" column. This will make a new column which I can use to group together the intervals which should be joined.
  4. Eventually I can use groupby with apply to join together the groups, and for each group create a single row of a new dataframe where I can take the earliest and latest date from the "Date" column.

With the rows you have posted, the final result is:

   Name Start Date   End Date  Value Note
0  Name 2008-12-21 2008-12-21    0.0  Yes
1  Name 2008-12-23 2009-03-15    0.0  Yes
2  Name 2009-03-17 2009-03-17    0.0  Yes
Valentino
  • 7,291
  • 6
  • 18
  • 34
  • Hi @Valentino, thank you for your feedback. I tried your sample code, but I am getting an empty data frame? The output is returning a data frame the same size as the original one, with no columns. – kn2298 Feb 02 '21 at 22:35
  • @knorr976 That's strange, I tested my code and it works for me. `rdf` is the dataframe I posted. How are you importing your data in python / pandas? – Valentino Feb 02 '21 at 23:13
  • very odd. I imported using pd.read_csv() with a txt. file. I then split columns use str.split(). – kn2298 Feb 03 '21 at 15:53
  • @knorr976 I also used `read_csv`. I've edited my answer explaining how I import the dataframe. Could you try again? – Valentino Feb 03 '21 at 17:51
  • still no luck. I dropped a few columns that were initially in the data frame but unecessary in the end. I use del df[] for this process. Could that have had an effect? – kn2298 Feb 03 '21 at 21:48
  • @knorr976 No, unless you delete one of the column needed by the code. Could you add to your question some lines of the txt file and the code to import it in pandas exactly? I still think the problem may be there. – Valentino Feb 03 '21 at 22:19
  • see edits. Thanks again for helping with this. – kn2298 Feb 04 '21 at 18:02
  • @knorr976 I see your problem now, the value column is of type string, you need to convert that to a float. For some reason in your case is not converted automatically. You need to add a line like `df["Value"] = df["Value"].astype(float)` in your initial manipulation of the txt file to create the dataframe. I guess it's the `.00` decimal format, `read_csv` fails to recognize a string like this as a number and does not do the conversion. You have to force it explicitly. – Valentino Feb 04 '21 at 19:21
  • that solved everything. Thank you so much again!! – kn2298 Feb 04 '21 at 22:38
0

To select all rows of the frame for which the 'Note' column is Yes, use:

df.loc[df['Note'] == 'Yes']

This yields a frame containing rows only satisfying the above condition.

For more useful operations see: How to select rows from a DataFrame based on column values

  • Thank you for your advice! But unfortunately, this doesn't solve my question. I need the periods of time where this occurs, i.e. the first and last date at which 'Yes' occurs. – kn2298 Feb 01 '21 at 23:17