2

How would I go about this in Python Pandas? Would I use Groupby for question 2? I don't want a answer in code, just pseudocode or explanation of operations would be fine.

Dataset 1
CITY    POPULATION
BOSTON   645,966
NEW YORK 8,336,697
CHICAGO  2,714,856

Dataset 2
Newspaper         City          Readers
Boston Globe     Boston, MA     245572
New York Times   New York, NY   1865318
Daily News       New York, NY   516165
New York Post    New York, NY   500521
Chicago Sun-Times Chicago, IL   470548 
Chicago Tribune  Chicago, IL     414930

List the operations (in order) to modify each value in the ‘City’ attribute in Dataset 2 so that it can be directly compared to the ‘CITY’ attribute in Dataset 1.

Assume each newspaper reader reads one paper and it is from their home city. List the operation(s) to calculate the total number of newspaper readers in each city.

Janeson00
  • 113
  • 9

2 Answers2

1

you could take unique value of city from dataset2 and then use the value to iterate through your dataframe conditionally. simply put:

    # Get unique city names
    city_list = dataset2.unique().tolist()
    # Add mapping to correct city names
    city_mapping = {
       'Boston, MA':'BOSTON'
    }

    # Dynamically iterate and replace with correct value
    for city in city_list:
        dataset2.loc[dataset2.city == city,'city'] = city_mapping[city]

And yes to 2nd question use groupby and sum and you can improve on this a lot, that you could figure out as you go. And you could generate a city_mapping dynamically as well by partially matching text from dataset2 to dataset1.

AJS
  • 1,993
  • 16
  • 26
1

First match the city names

city_dict = {
    'Boston, MA':'BOSTON'
    'New York, NY': 'NEW YORK'
    'Chicago, IL': 'CHICAGO'
}

dataset2['CITY'] = dataset['City'].map(city_dict)

Then group Dataset 2 by the 'CITY' column and sum the 'Readers' column

Here's the link to the Pandas documentation for groupby. Essentially you're doing the same thing as the first example except you are only grouping by one column instead of two and are taking the sum instead of the mean. If you get stuck I can give you a code example, I realize you specifically asked not to have one.

pistolpete
  • 968
  • 10
  • 20