0

Please help me to figure out how to do it. I have a dataframe. In the column "Indicator" there are a bunch of different parameters (strings) but I need only "Life satisfaction". I do not know how to delete other indicators like "Dwellings without basic facilities" and their corresponding Values and Countries.

import numpy as np
import pandas as pd

oecd_bli = pd.read_csv("/Users/vladelec/Desktop/Life.csv")
df = pd.DataFrame(oecd_bli)
df.drop(df.columns[[0,2,4,5,6,7,8,9,10,11,12,13,15,16]], axis=1, inplace=True) 
#dropped other columns that a do not need

Here is a screenshot of my dataframe:

Example of Dataframe

  • you dont need to do `oecd_bli = pd.read_csv("/Users/vladelec/Desktop/Life.csv") df = pd.DataFrame(oecd_bli)` only the first line. – foxyblue Sep 14 '17 at 23:10
  • Possible duplicate of [Deleting DataFrame row in Pandas based on column value](https://stackoverflow.com/questions/18172851/deleting-dataframe-row-in-pandas-based-on-column-value) – foxyblue Sep 14 '17 at 23:23

2 Answers2

1

You can load in your data like so:

file_name = "/Users/vladelec/Desktop/Life.csv"

# Columns you want to load
keep_cols = ['Country', 'Indicator']

# pd.read_csv() will load the data into a pd.DataFrame
oecd_bli = pd.read_csv(file_name, usecols=keep_cols)

If you only want "Life Satisfaction" from Indicator then you can do the following:

oecd_bli = oecd_bli[oecd_bli['Indicator'] == "Life Satisfaction"]

If you have more Indicators you would like to keep, then you can do this:

keep_indicators = [
    "Life Satisfaction",
    "Crime Indicator",
]

oecd_bli = oecd_bli[oecd_bli['Indicator'].isin(keep_indicators)]
foxyblue
  • 2,859
  • 2
  • 21
  • 29
0

@GiantsLoveDeathMetal has good points. In principle, you can read the raw data in as oecd_bli and select subsets of the DataFrame that satisfy certain conditions.

Demo

import pandas as pd


# Given a DataFrame of raw data
d = {
    "Country": pd.Series(["Australia", "Austria", "Fiji", "Japan"]),
    "Indicator": pd.Series(["Dwellings ...", "Dwellings ...", "Life ...", "Life ..."]),
    "Value": pd.Series([1.1, 1.0, 2.2, 2.9]),
}

oecd_bli = pd.DataFrame(d, columns=["Country", "Indicator", "Value"] )
oecd_bli

enter image description here

# Select rows starting with "Life" in column "Indicator"
condition = oecd_bli["Indicator"].str.startswith("Life")
subset = oecd_bli[condition]
subset

enter image description here

Alternatively, select a subset using label indexing via .loc:

subset = oecd_bli.loc[condition, :]

Here loc expects [<rows>, <columns>]. Thus, those rows that meet the condition are displayed.


Details

Notice a view of the DataFrame is presented for every row that gives a True condition. This is because the DataFrame responds to the boolean arrays.

Example of a boolean array:

>>> condition = oecd_bli["Indicator"].str.startswith("Life")
>>> condition

0    False
1    False
2     True
3     True
Name: Indicator, dtype: bool

Other ways to setup conditions:

>>> condition = oecd_bli["Indicator"] == "Life ..."
>>> condition = ~oecd_bli["Indicator"].str.startswith("Dwell")
>>> condition = oecd_bli["Indicator"].isin(["Life ...", "Crime ..."])
>>> condition = (oecd_bli["Indicator"] == "Life ...") | (oecd_bli["Indicator"] == "Crime ...") 
  1. direct equality (==)
  2. exclude (~) undesired occurrences
  3. include whitelisted columns via isin
  4. multiple comparisons with logical bit operators (|, &, etc.)
pylang
  • 40,867
  • 14
  • 129
  • 121