0

Overview

I'd like to return the value of a column in a csv based upon a search. I'm most of the way there, but do not understand why I'm getting the result I am, and how to fix it up. I'm using pandas 0.25 to read in the csv. This version was chosen rather than 1.03 due to some issues with 1.03 and geopandas interaction later on.

The question: Why am I receiving a row number, how do I omit the row number from the returned result?

A subset of my code is as follows:

import numpy as np
import os
import pandas as pd

code_ITS_path = 'input_ITS.csv'
code_ITS = pd.read_csv(code_ITS_path,sep=',')

for trial_name in code_ITS.Trial_Name[code_ITS.Trial_Name.str.contains("Trial")]:
   scenario_name = code_ITS.Scenario_Name[code_ITS.Trial_Name.str.contains(trial_name)]
   print("Trial Name is " + trial_name)
   print("Scenario Name is " + scenario_name)

 
 

Problem

The results that are being returned appear to include the row number before "Scenario Name is" Unfortunately this is causing me issues later on. So, printing out trial_name works, but printing out scenario_name shows the problem. The issue here is the presence of "24 " before the scenario_name. eg:

Trial Name is 500_10_3_Trial_24
24    Scenario Name is 500_10_3_24

 
 

Desired Outcome

What I need to return: Ie, scenario_name does not contain the line number and tab indent

Scenario Name is 500_10_3_25

 
 

Example Data

An example of the CSV:

Agency,Region,Requesting_Group,Scenario_Name,Type,Scenario_Description,Trial_Name,Notes
15,Main,,500_10_3_B,Operational,Testing,500_10_3_BestEstimate,None
15,Main,,500_10_3_1,Operational,Testing,500_10_3_Trial_1,None
15,Main,,500_10_3_2,Operational,Testing,500_10_3_Trial_2,None
15,Main,,500_10_3_3,Operational,Testing,500_10_3_Trial_3,None
15,Main,,500_10_3_4,Operational,Testing,500_10_3_Trial_4,None
15,Main,,500_10_3_5,Operational,Testing,500_10_3_Trial_5,None
15,Main,,500_10_3_6,Operational,Testing,500_10_3_Trial_6,None
15,Main,,500_10_3_7,Operational,Testing,500_10_3_Trial_7,None
15,Main,,500_10_3_8,Operational,Testing,500_10_3_Trial_8,None
15,Main,,500_10_3_9,Operational,Testing,500_10_3_Trial_9,None
15,Main,,500_10_3_10,Operational,Testing,500_10_3_Trial_10,None
15,Main,,500_10_3_11,Operational,Testing,500_10_3_Trial_11,None
15,Main,,500_10_3_12,Operational,Testing,500_10_3_Trial_12,None
15,Main,,500_10_3_13,Operational,Testing,500_10_3_Trial_13,None
15,Main,,500_10_3_14,Operational,Testing,500_10_3_Trial_14,None
15,Main,,500_10_3_15,Operational,Testing,500_10_3_Trial_15,None
15,Main,,500_10_3_16,Operational,Testing,500_10_3_Trial_16,None
15,Main,,500_10_3_17,Operational,Testing,500_10_3_Trial_17,None
15,Main,,500_10_3_18,Operational,Testing,500_10_3_Trial_18,None
15,Main,,500_10_3_19,Operational,Testing,500_10_3_Trial_19,None
15,Main,,500_10_3_20,Operational,Testing,500_10_3_Trial_20,None
15,Main,,500_10_3_21,Operational,Testing,500_10_3_Trial_21,None
15,Main,,500_10_3_22,Operational,Testing,500_10_3_Trial_22,None
15,Main,,500_10_3_23,Operational,Testing,500_10_3_Trial_23,None
15,Main,,500_10_3_24,Operational,Testing,500_10_3_Trial_24,None
15,Main,,500_10_3_25,Operational,Testing,500_10_3_Trial_25,None
anakaine
  • 1,188
  • 2
  • 14
  • 30
  • `for variable in series` iterates over the row numbers. – Barmar Apr 08 '20 at 21:49
  • Ok, I get that part. Can I coerce it not to show up on the result? The for trial_name component does not show the line number, but scenario_name does and the method is quite similar. It's causing issues doesn the line when I try to set a dataframe column to that value, and also it means my outputs won't match the inputs textually. – anakaine Apr 08 '20 at 21:52
  • @Barmar Do you know if it's version-dependent? I'm using 1.0.3, and I don't get the index when iterating. – AMC Apr 08 '20 at 22:09
  • Please provide a [mcve]. By the way, instead of `code_ITS.Trial_Name[code_ITS.Trial_Name.str.contains("Trial")]` it's probably best to use `code_ITS.loc[code_ITS["Trial_Name"].str.contains("Trial"), "Trial_Name"]`. – AMC Apr 08 '20 at 22:10
  • I guess I was wrong about that. – Barmar Apr 08 '20 at 22:15
  • @AMC why the change to using .loc? – anakaine Apr 08 '20 at 22:17
  • Also, @AMC, I've updated the first code block in the question. Save that out as a .py, and save out the example data as Input_ITS.csv into the same directory and it should run. – anakaine Apr 08 '20 at 22:18
  • @AMC, I tried changing the scenario name code to also use .loc, and now it's also returning a second line that includes dtype. `scenario_name = code_ITS.loc[code_ITS["Trial_Name"].str.contains(trial_name), "Scenario_Name"]` = `20 Scenario Name is 500_10_3_20 Name: Scenario_Name, dtype: object` – anakaine Apr 08 '20 at 22:33
  • @anakaine The whole dtype thing is to be expected, I must leave for a few minutes, I will try to figure out why the index is being printed. – AMC Apr 08 '20 at 22:54
  • Thanks very much – anakaine Apr 08 '20 at 23:06
  • _This version was chosen rather than 1.03 due to some issues with 1.03 and geopandas interaction later on._ I hadn't heard of that, do you know where I can find more on the subject? – AMC Apr 09 '20 at 00:22
  • @AMC I believe this is the appropriate thread. Actually, my issue wasn't geopandas (that was something else), but an error/omission in the 1.0.2 / 1.0.3 release. https://github.com/pandas-dev/pandas/issues/32857 – anakaine Apr 09 '20 at 02:41

2 Answers2

0

The issue is the code_ITS.Scenario_Name[code_ITS.Trial_Name.str.contains(trial_name)] returns a Series so printing it also show the row number. If you want to discard the row number you can add .values i.e.

code_ITS.Scenario_Name[code_ITS.Trial_Name.str.contains(trial_name)].values

which returns a numpy.array so it wont show the row number. The output will than look like this:

Trial Name is 500_10_3_Trial_1
['Scenario Name is 500_10_3_1']
Trial Name is 500_10_3_Trial_2
['Scenario Name is 500_10_3_2']
Trial Name is 500_10_3_Trial_3
['Scenario Name is 500_10_3_3']
Trial Name is 500_10_3_Trial_4
['Scenario Name is 500_10_3_4']
Trial Name is 500_10_3_Trial_5
['Scenario Name is 500_10_3_5']
Trial Name is 500_10_3_Trial_6
['Scenario Name is 500_10_3_6']
Trial Name is 500_10_3_Trial_7
['Scenario Name is 500_10_3_7']

if you want to get rid of the brackets add [0] at the end. You'll need to be a bit more careful since you may end up with an empty array so you'll need to check it as well.

The following code

import numpy as np
import os
import pandas as pd

code_ITS_path = 'input_ITS.csv'
code_ITS = pd.read_csv(code_ITS_path,sep=',')

for trial_name in code_ITS.Trial_Name[code_ITS.Trial_Name.str.contains("Trial")]:
    try:
        scenario_name = code_ITS.Scenario_Name[code_ITS.Trial_Name.str.contains(trial_name)].values[0]
        print("Trial Name is " + trial_name)
        print("Scenario Name is " + scenario_name)
    except Exception as e:
        print(e)

gives the output:

Trial Name is 500_10_3_Trial_1
Scenario Name is 500_10_3_1
Trial Name is 500_10_3_Trial_2
Scenario Name is 500_10_3_2
Trial Name is 500_10_3_Trial_3
Scenario Name is 500_10_3_3
Trial Name is 500_10_3_Trial_4
Scenario Name is 500_10_3_4
Trial Name is 500_10_3_Trial_5
Scenario Name is 500_10_3_5
Trial Name is 500_10_3_Trial_6
Scenario Name is 500_10_3_6
Trial Name is 500_10_3_Trial_7
Scenario Name is 500_10_3_7
  • Thank you very much for the explanation of returning the np array, and then showing how to get the value. Also, thanks for being attentive enough to show how to catch the error. As someone who is still learning that's a huge help and I really appreciate it. I've marked the answer as correct, too. – anakaine Apr 08 '20 at 23:56
  • _If you want to discard the row number you can add `.values` [...] which returns a numpy.array so it wont show the row number._ The use of `.values` is discouraged, see [the docs](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.values.html). _if you want to get rid of the brackets add `[0]` at the end._ In which case there is no need to convert the Series to anything. – AMC Apr 09 '20 at 00:20
  • @anakaine _Also, thanks for being attentive enough to show how to catch the error._ Be very careful, that is not a good way of dealing with exceptions. See, for example: https://stackoverflow.com/questions/54948548/what-is-wrong-with-using-a-bare-except. – AMC Apr 09 '20 at 00:21
  • You've given me a fair bit to go through. Thanks – anakaine Apr 09 '20 at 02:41
0

scenario_name is a Series, which is why other things like the index and name are printed.

Instead of getting the column and then indexing that, you can just use loc or iloc:

import pandas as pd

code_ITS = pd.read_csv("../resources/test.csv")

for trial_name in code_ITS.loc[code_ITS["Trial_Name"].str.contains("Trial"), "Trial_Name"]:
    scenario_name = code_ITS.loc[code_ITS["Trial_Name"].str.contains(trial_name), "Scenario_Name"]
    print(f"\nTrial Name is:\n{trial_name}\nScenario Name is:\n{scenario_name}")

The first few lines of output:

Trial Name is:
500_10_3_Trial_1
Scenario Name is:
1      500_10_3_1
10    500_10_3_10
11    500_10_3_11
12    500_10_3_12
13    500_10_3_13
14    500_10_3_14
15    500_10_3_15
16    500_10_3_16
17    500_10_3_17
18    500_10_3_18
19    500_10_3_19
Name: Scenario_Name, dtype: object

Trial Name is:
500_10_3_Trial_2
Scenario Name is:
2      500_10_3_2
20    500_10_3_20
21    500_10_3_21
22    500_10_3_22
23    500_10_3_23
24    500_10_3_24
25    500_10_3_25
Name: Scenario_Name, dtype: object

...

You can find more information on the differences between loc and standard indexing here.

Notice that I use [ ] to access columns, I find it far safer and more convenient than the ./attribute style.


I think it's also worth mentionning that the way you're using Series.str.contains() might lead to some problems. This example should help illustrate why:

import pandas as pd

df = pd.DataFrame({"col_1": ["training", "restrain"]})

print(df, end="\n\n")

print(df.loc[df["col_1"].str.contains("train"), "col_1"])

Output:

      col_1
0  training
1  restrain

0    training
1    restrain
Name: col_1, dtype: object

The fact that your data follows a rather clear pattern is a great thing, you should be able to write a solid regex to use instead.

AMC
  • 2,642
  • 7
  • 13
  • 35