1

I am appending three CSVs:


df = pd.read_csv("places_1.csv")
temp = pd.read_csv("places_2.csv")
df = df.append(temp)
temp = pd.read_csv("places_3.csv")
df = df.append(temp)
print(df.head(20))

the joined table looks like:

  location  device_count  population
0        A            11         NaN
1        B            12         NaN
2        C            13         NaN
3        D            14         NaN
4        E            15         NaN
0        F            21         NaN
1        G            22         NaN
2        H            23         NaN
3        I            24         NaN
4        J            25         NaN
0        K            31         NaN
1        L            32         NaN
2        M            33         NaN
3        N            34         NaN
4        O            35         NaN

As you can see the indices are not unique.

When I call this iloc function to multiply the population column by 2:

df2 = df.copy
for index, row in df.iterrows():
    df.iloc[index, df.columns.get_loc('population')] = row['device_count'] * 2

I get the following erronious result:

  location  device_count  population
0        A            11        62.0
1        B            12        64.0
2        C            13        66.0
3        D            14        68.0
4        E            15        70.0
0        F            21         NaN
1        G            22         NaN
2        H            23         NaN
3        I            24         NaN
4        J            25         NaN
0        K            31         NaN
1        L            32         NaN
2        M            33         NaN
3        N            34         NaN
4        O            35         NaN

For each CSV it is overwriting the indexes of the first CSV I have also tried creating a new column of integers and calling df.set_index(). That did not work.

Any tips?

Steve Scott
  • 1,441
  • 3
  • 20
  • 30

2 Answers2

2

First, use ignore_index, second, don't use append, use pd.concat([temp1, temp2, temp3], ignore_index=True).

Igor Rivin
  • 4,632
  • 2
  • 23
  • 35
1

As others have stated, you can use ignore_index, and you probably should use pd.concat here. Alternatively, for other situations where you are not combining DataFrames, you can also use df = df.reset_index(drop=True) to change the indices after the fact.

Additionally, you should avoid using iterrows() for reasons listed in the docs here. Using the following works way better:

df.loc[:, 'population'] = df.loc[:, 'device_count'].astype('int') * 2
bug_spray
  • 1,445
  • 1
  • 9
  • 23
  • the *2 is just a proxy for a more complex function I am running that involves multiple columns from the row. – Steve Scott May 31 '20 at 16:42
  • why use concat instead of append? – Steve Scott May 31 '20 at 16:47
  • @SteveScott It's more flexible, and in much wider use, so is likely to be more efficient. https://machinelearningknowledge.ai/tutorial-pandas-concat-pandas-append-pandas-merge-pandas-join/ – Igor Rivin May 31 '20 at 16:50
  • Strange. `df = df.reset_index(drop=True)` or `df.reset_index(drop=True, inplace=True)` should work... – bug_spray May 31 '20 at 16:55
  • If vectorization is not an option for your function, I recommend you refer to cs95's answer to this [question](https://stackoverflow.com/questions/16476924/how-to-iterate-over-rows-in-a-dataframe-in-pandas/55557758?noredirect=1) for a performance analysis of modifying dataframes with functions. You'll notice `iterrows()` is the last tool you should reach for – bug_spray May 31 '20 at 16:59