1

I have the table from ImDB with actors.

enter image description here

From this table I want to drop all rows where imdb_actors.birthYear is missing or is less than 1950 and also drope those where imdb_actors.deathYear has some value.

Idea is to get a dataset with actors who are alive and not retired.

imdb_actors.birthYear.dtype
Out:dtype('O')

And I can't convert to string, this doesn't help: imdb_actors['birthYear'] = imdb_actors['birthYear'].astype('|S'). It just ruins all years.

That's why I can't execute: imdb_actors[imdb_actors.birthYear >= 1955] When I try imdb_actors.birthYear.astype(str).astype(int) I get the message: ValueError: invalid literal for int() with base 10: '\\N'

What will be the way to drop missing and apply >= 1950 condition?

Pinky the mouse
  • 197
  • 1
  • 4
  • 10

2 Answers2

2

First convert numeric data to numeric series:

num_cols = ['birthYear', 'deathYear']
df[num_cols] = df[num_cols].apply(pd.to_numeric, errors='coerce')

Specifying errors='coerce' forces non-convertible elements to NaN.

Then create masks for your 3 conditions, combine via the vectorised | "or" operator, negate via ~, and apply Boolean indexing on your dataframe:

m1 = df['birthYear'].isnull()
m2 = df['birthYear'] < 1950
m3 = df['deathYear'].notnull()

res = df[~(m1 | m2 | m3)]
jpp
  • 159,742
  • 34
  • 281
  • 339
  • df[num_cols] = df[num_cols].apply(pd.to_numeric, errors='coerce') loses good cells with years and just gives me back each column as NaN. – Pinky the mouse Sep 05 '18 at 15:45
  • I'd love to check this out, but I'm having trouble copy-pasting the image you posted on your question into my code. See also [How to make good reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples). – jpp Sep 05 '18 at 15:48
0

Your problem is that the type of your birthYear serie is Object which would be for strings or a mix of types.

You will want to clean this serie first by applying a function like this :

imdb_actors.birthYear = imdb_actors.birthYear.map(lambda x: int(x) if str(x) != '\\N' else pd.np.nan)

then you can do your filtering:

imdb_actors[imdb_actors.birthYear >= 1955]
Alex
  • 816
  • 5
  • 14
  • Trying to run the first line I get the error: ValueError: invalid literal for int() with base 10: b'\\N' – Pinky the mouse Sep 05 '18 at 15:33
  • could you post a sample of your data ? – Alex Sep 05 '18 at 15:37
  • You should use vectorised operations. `map` + `lambda` on `object` dtype is no better than a simple loop with a `list`. – jpp Sep 05 '18 at 15:39
  • I have a great trouble converting tables to a nice format on SO. Is there a way to paste it nicely? Sorry for silly question. – Pinky the mouse Sep 05 '18 at 15:43
  • @Pinkythemouse, That's a good question. See this question for answers: [How to make good reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples). – jpp Sep 05 '18 at 15:44