How to parse a lot of txt files with pandas and somehow understand from which file each raw of the table

Question

I have a dataset containing the name, gender, and quantity of people with their names. There are a lot of text files (>100). Each of them has the same information with different quantity parameters but for 1880, 1881 .... 2008 years. Here is a link to make it more clear: https://github.com/wesm/pydata-book/tree/2nd-edition/datasets/babynames How can I import all of these files and mark raws with appropriate years? So the table looks like this:

YEAR   NAME  GENDER  QUANTITY
1998   Marie    F      2994  
1996   John     M      2984
1897   Molly    F       54

The main concern is how to mark each raw with appropriate year according to the filename.

Here is my code for 1 file, but i need to do the same for more than 100 text files...

import pandas as pd

df = pd.read_csv("yob1880.txt", header=None)
df["year"] = 1880 # add new column according to the file`s year
print(df)

Hey, welcome to StackOverflow :) Can you write your starting code in the post? — floatingpurr, Jan 10 '19 at 11:08
The Files you are referring there is no years defined rathe files are with created with years. — Karn Kumar, Jan 10 '19 at 11:25
In case you are able to import all files , Do you need `df["year"] = 1880` for all? — Karn Kumar, Jan 10 '19 at 11:28
Yes, that's the problem, I have files named yob1881.txt, yob1882.txt but I don`t have such column in the datasets, is it possible to add such column according to the name of the file? — Евгений Матвийчук, Jan 10 '19 at 11:28
if this data from yob1881.txt file, it`s "Year" column should have value 1881, if this data from yob1994.txt file - it should have 1994 value in the "Year" column and so on — Евгений Матвийчук, Jan 10 '19 at 11:29

jpp · Accepted Answer · 2019-01-10T11:50:11.147

0

There are two issues here:

How to extract year from filename and assign to new column.
How to concatenate multiple dataframes.

You can use string slicing and pd.DataFrame.assign for the former; pd.concat for the latter. Assuming your filenames are of the format yobXXXX.txt:

df = pd.concat(pd.read_csv(fn).assign(YEAR=int(fn[3:7])) for fn in filenames)

Or if you wish to ignore indices:

df = pd.concat((pd.read_csv(fn).assign(YEAR=int(fn[3:7)) for fn in filenames),
               ignore_index=True)

edited Jan 10 '19 at 11:50

answered Jan 10 '19 at 11:45

jpp

159,742
34
281
339

the trick with assign works well, but concatenation works wrong in this case It gives a huge matrix with tons of columns and NaN values... – Евгений Матвийчук Jan 10 '19 at 12:44
Update: it worked after I add the names for the columns by adding: new_df.columns = ["Name", "Gender", "Quantity"] Thanks a lot! – Евгений Матвийчук Jan 10 '19 at 13:10

How to parse a lot of txt files with pandas and somehow understand from which file each raw of the table

1 Answers1