0

I imported two datasets in my notebook. One which consists of the names that were given to baby's for each state in each year (and the number, [count] of it). The other one consists of the total number of people in the U.S with that name with that specific name. (see added images).

The problem here is that I appended the two datasets, so that they became one dataset, since merging and concat didn't work.

I want to create a column 'Totalcount' next to all the columns where the state is also mentioned (not NAN), so that I can compare the babynames to the national count of that specific name for each state.

For example: in 1934, 4 baby's have been named 'Anna' in Texas. The number of people named 'Anna' in the U.S were 14000 in 1934. The number of 'totalcount' should be 14000 then. This number should be 14000 for next to the other states as well. So: 5 baby's have been named 'Anna' in California, and total count would still be 14000. I have tried many things but I can't figure it out.

The dataset, above: the national count of names in each year, but without the States. Under: The count of certain babynames which are given in each state, I appended these two datasets since merging or concat didn't work.

example states dataset:

df = pd.DataFrame([['AF',1910.0, 'F', 'Mary', 14.0], ['TX',1910.0, 'M' , 'Lex', 15.0], ['CA',1920.0,'M','Tom', 14.0]], columns=['State', 'Sex','Year','Name','Count'])

example national dataset:

df2 = pd.DataFrame([['Anna','F', 2402,1910], ['Ben','M',5044,1912], ['Simon','M',39203,1910]], columns=['Name', 'Sex','Count','Year'])
df2
Dev Khadka
  • 5,142
  • 4
  • 19
  • 33
L. Koning
  • 1
  • 2
  • 1
    Hi, and welcome to Stack Overflow. I think I understand your question in general, but without more information, it's hard to answer... – Itamar Mushkin Sep 26 '19 at 11:23
  • Just to be sure, you want total count of names irrespective of the years right? – YOLO Sep 26 '19 at 11:24
  • 1
    First - can you create a minimal example that reproduces your problem? Like a `pd.DataFrame` with even only 2 or 3 rows, posted as code (or as data, but code is best) and not as image – Itamar Mushkin Sep 26 '19 at 11:24
  • 1
    With such an example, we could also understand what doesn't work in your joining process. – Itamar Mushkin Sep 26 '19 at 11:25
  • In my opinion, merging should have been the way to go: ` pd.merge(left=df, right=df2, on='Name')`. Maybe you can elaborate why it did not work for you. – Julia K Sep 26 '19 at 11:33
  • No, I want the total count of names like 'Anna' in the specific year that is in the States dataset (where number of babynames given in each state is) The year does matter. So in 1993, 4 baby's have been given the name 'Anna' in TX (texas), while there were 14000 people named 'Anna' in 1993. The 14000 'Anna's, should be in the total_count colomn, (since there were 14000 Anna's in 1993). This should work for all the other states as well. @ YOLO – L. Koning Sep 26 '19 at 11:35
  • As a new contributor, you really should read [ask]. You are supposed to give a [mcve], not a link to an non copy-pastable image. An when it comes to pandas please read [How to make good reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) – Serge Ballesta Sep 26 '19 at 11:37
  • @JuliaKraus The code you mentioned will give a memory error, already tried it. – L. Koning Sep 26 '19 at 11:38
  • I've added example dataframes so that it becomes a bit clearer :) – L. Koning Sep 26 '19 at 11:48
  • 1
    You might want to try reading in the two dataframes using `low_memory = True` parameter in `pd.read_csv` and then merging again. See https://www.kaggle.com/kunalkotian/easily-load-train-csv-w-o-crash-save-feather-file – Julia K Sep 26 '19 at 11:57
  • @Julia K this worked! Thank you so much!! :D – L. Koning Sep 26 '19 at 14:52

0 Answers0