0

First of all, this is not a duplicate! I have searched in several SO questions as well as the Pandas doc, and I have not found anything conclusive!To create a new column with a row value, like this and this!

Imagine I have the following table, opening an .xls and I create a dataframe with it. As this is a small example created from the real proble, I created this simple Excel table which can be easily reproduceable:

Table

What I want now is to find the row that has "Population Month Year" (I will be looking at different .xls, so the structure is the same: population, month and year.

xls='population_example.xls'
sheet_name='Sheet1'
df = pd.read_excel(xls, sheet_name=sheet_name, header=0, skiprows=2)
df

What I thought is:

  1. Get the value of that row with startswith

  2. Create a column, pythoning that value and getting the month and year value.

I have tried several things similar to this:

dff=df[s.str.startswith('Population')]
dff

But errors won't stop coming. In this above's code error, specifically:

IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match

I have several guesses:

  • I am not understanding properly how Seriesin pandas work, even though reading the doc. I did not even think on using them, but the startswithlooks like the thing I am looking for.
  • If I handle this properly, I might have a NaN error, but I cannot use df.dropna()yet, as I would lose that row value (Population April 2017)!

Edit:

The problem on using this:

df[df['Area'].str.startswith('Population')] Is that it will check the na values.

And this:

df['Area'].str.startswith('Population')

Will give me a true/false/na set of values, which I am not sure how I can use.

M.K
  • 1,464
  • 2
  • 24
  • 46
  • what is `s` here? I think you want `df[df['Area'].str.startswith('Population')]` – Erfan Jul 01 '19 at 15:48
  • 1
    But first of all reformat your question in a decent format, dont post [pictures of your code/data](https://meta.stackoverflow.com/questions/285551/why-not-upload-images-of-code-on-so-when-asking-a-question). We cannot copy and paste that to reproduce an answer. And how would your expected output look like? – Erfan Jul 01 '19 at 15:50
  • 2
    Try: `df[df['Area'].str.startswith('Population'), na=False]` To fix your `NaN` problem – Erfan Jul 01 '19 at 15:52
  • I am opening a `.xls`. I will transform it into a `.csv`so you can open it! This is a small problem from the real one, which has a more complex table, etc, so that is why I did not want to bother with extra info @Erfan . omw to edit! – M.K Jul 01 '19 at 15:52
  • 1
    We don't need a csv. Just post an example dataset here, read [this](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) for more information. – Erfan Jul 01 '19 at 15:55
  • Please have a look at [mcve](/help/mcve) – rpanai Jul 01 '19 at 15:56
  • 1
    On my way! Thanks for all the feedback! Not really used to ask about pandas @Erfan . The starts with did the trick to get me the whole row! – M.K Jul 01 '19 at 15:58
  • I posted the solution! If you want, post that same solution yourself, and I shall accept it, upvote it and delete mine as you made me get there! @Erfan – M.K Jul 01 '19 at 16:08

2 Answers2

1

Thanks to @Erfan , I got to the solution:

Using properly the line of code in the comments and not like I was trying, I managed to:

dff=df[df['Area'].str.startswith('Population', na=False)] dff

Which would output: Population and household forecasts, 2016 to 20... NaN NaN NaN NaN NaN NaN

Now I can access this value like

value=dff.iloc[0][0] value

To get the string I was looking for: 'Population and household forecasts, 2016 to 2041, prepared by .id , the population experts, April 2019.' And I can python around with this to create the desired column. Thank you!

M.K
  • 1,464
  • 2
  • 24
  • 46
  • 1
    You might want to use: `df.loc[df['Area'].str.startswith('Population', na=False), 'Area'].iat[0]` Which will give you the value as well – Erfan Jul 01 '19 at 16:15
  • That is actually much better and optimal! Thank you so much for the patience and goot attitude towards the question! @Erfan – M.K Jul 01 '19 at 16:19
1

You could try:

import pandas as pd
import numpy as np

pd.DataFrame({'Area': [f'Whatever{i+1}' for i in range(3)] + [np.nan, 'Population April 2017.'],
              'Population': [3867, 1675, 1904, np.nan, np.nan]}).to_excel('population_example.xls', index=False)

df = pd.read_excel('population_example.xls').fillna('')

population_date = df[df.Area.str.startswith('Population')].Area.values[0].lstrip('Population ').rstrip('.').split()

Result:

['April', '2017']

Or (if Population Month Year is always on the last row):

df.iloc[-1, 0].lstrip('Population ').rstrip('.').split()
René
  • 4,594
  • 5
  • 23
  • 52