3

I have a dataset containing the name, gender, and quantity of people with their names. There are a lot of text files (>100). Each of them has the same information with different quantity parameters but for 1880, 1881 .... 2008 years. Here is a link to make it more clear: https://github.com/wesm/pydata-book/tree/2nd-edition/datasets/babynames How can I import all of these files and mark raws with appropriate years? So the table looks like this:

YEAR   NAME  GENDER  QUANTITY
1998   Marie    F      2994  
1996   John     M      2984
1897   Molly    F       54

The main concern is how to mark each raw with appropriate year according to the filename.

Here is my code for 1 file, but i need to do the same for more than 100 text files...

import pandas as pd

df = pd.read_csv("yob1880.txt", header=None)
df["year"] = 1880 # add new column according to the file`s year
print(df)

1 Answers1

0

There are two issues here:

  1. How to extract year from filename and assign to new column.
  2. How to concatenate multiple dataframes.

You can use string slicing and pd.DataFrame.assign for the former; pd.concat for the latter. Assuming your filenames are of the format yobXXXX.txt:

df = pd.concat(pd.read_csv(fn).assign(YEAR=int(fn[3:7])) for fn in filenames)

Or if you wish to ignore indices:

df = pd.concat((pd.read_csv(fn).assign(YEAR=int(fn[3:7)) for fn in filenames),
               ignore_index=True)
jpp
  • 159,742
  • 34
  • 281
  • 339