How to read multiple csv files and plot histogram

Question

I already asked the same question, and it looked to be unclear.So let me ask it in different way.I have four .csv files named as I_earthquake2016.csv I_earthquake2017.csv I_earthquake2018.csv I_earthquake2019.csv (earthquake data in different years) They all have the same columns just the number of rows is different. I made some codes to read one of the files, and make the histogram to see how many earthquakes happen each month.

Questions:

I don't know how to make a code to read all the files and plot the the same histogram for each of them(use loop)
I don't know how to make a histogram to show the numbers of earthquakes for each year(between 2016-2019)

Can anybody please teach me how to it. thank you.

import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

data = pd.read_csv('I_earthquake2017.csv')
print(data[:1])

Output line1:

time  latitude  longitude  depth  mag

0 2017-12-30 20:53:24.700000+00:00   29.4481    51.9793   10.0  4.9



data['time']=pd.to_datetime(data['time'])
data['MONTH']=data['time'].dt.month
data['YEAR']=data['time'].dt.year
print(data[:1])

Output Line 1

time  latitude  longitude  depth  mag  MONTH  YEAR

0 2017-12-30 20:53:24.700000+00:00   29.4481    51.9793   10.0  4.9   12   2017




plt.hist(x=[data.MONTH],bins=12,alpha=0.5)
plt.show()

Does this answer your question? [Can pandas automatically recognize dates?](https://stackoverflow.com/questions/17465045/can-pandas-automatically-recognize-dates) — Björn, Apr 25 '20 at 08:47
We dont know how the dates are represented in your csv. However generally speaking you want to work with [datetime objects](https://docs.python.org/3/library/datetime.html). The `pd.read_csv()` method has a functionality to automatically [parse dates](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) — Björn, Apr 25 '20 at 08:50
Do you want the histplot display the average number or a separate histogram for all 4 years? — Björn, Apr 27 '20 at 06:45
@BjörnB Thank you. It works just for one csv file. How can I put the above code into a loop and plot a histogram for all 4 years as a 2*2 subplots (x_axis= month & y_axis=number of the earthquake ). also one figure with x-axis=year and y_axis= number of the earthquakes. — sam_sam, Apr 27 '20 at 08:50
I hope my answer solved your question. If it does consider accepting it :) — Björn, Apr 27 '20 at 10:21

Björn · Accepted Answer · 2020-04-28T12:26:49.423

EDIT: Included a sorted in the assignment of csv_list to rearrange the subplots in the right order
changed line -> csv_list = sorted(list(base_dir.glob("*.csv")))

so I simulated your data (for those interested the code for simulation is the last part of this answer)

Necessary imports for the code

#!/usr/bin/env python3
import calendar
from pathlib import Path
import matplotlib.pyplot as plt
import pandas as pd

Answer 1: Read Multiple .csv Files

There is the library glob, however I prefer the built-in pathlib implementation of glob. Both allow you to search for a regex pattern (like *.csv), see below quote from the docs:

Glob the given relative pattern in the directory represented by this path, yielding all matching files (of any kind)

The code below gives you a list of pandas DataFrame. The argument parse_dates=['time'] automatically convers the column time to a datetime. So you don't need pd.to_datetime() anymore. You will need to adapt the base in base_dir to match the correct directory on your pc.

# Read in mulitple CSV Files
base_dir = Path("C:/Test/Earthquake-Data")
csv_list = sorted(list(base_dir.glob("*.csv")))
df_list = [pd.read_csv(file, index_col=0,parse_dates=['time']) for file in csv_list]

Answer 2: Plot Multiple Histograms

You can create a 2 x 2 subplot with plt.subplots() in the code below I iterate over the list of dataframes together with the list of axes with zip(df_list,fig.get_axes()) and unpack them the resulting tuple of *(df, axes) in the to variables df and ax. In the loop I use the vectorized .dt.month on the time column to create the histogram and change some of the appearance parameters, i.e.:

Title of the subplots set to the year title=str(df['time'].dt.year[0])
Set the labels on the ticks of the x-axis to the abbreviated month names (stored in list(calendar.month_abbr[1:])). Please recognized that I import calendar in the first part of my answer (above).
Rotate the x-labels (abbreviated month) to increase readability

Code:

fig, ax = plt.subplots(2,2)
for df, ax in zip(df_list,fig.get_axes()):
    df['time'].dt.month.plot(kind="hist",ax=ax,bins=12,title=str(df['time'].dt.year[0]))
    ax.set_xticks(range(1,13))
    ax.set_xticklabels(list(calendar.month_abbr[1:]))
    # Rotate the xticks for increased readability
    for tick in ax.get_xticklabels():
        tick.set_rotation(45)
fig.tight_layout()
plt.show()

Simulate Earthquake Data

#!/usr/bin/env python3
import numpy as np
import pandas as pd
from my_utils.advDateTime import random_datetimes
from pathlib import Path

year_range = range(2016,2020)
time = [random_datetimes(pd.to_datetime(f"1/1/{year}"), pd.to_datetime(f"1/1/{year + 1}"), n=100) \
                for year in year_range]
lattitude = [np.random.randint(0,100,100) for i in range(4)]
data = {'Lattitude': lattitude[0],'time':time[0]}
list_dfs = [pd.DataFrame({'Lattitude': data,'time':y}).sort_values("time").reset_index(drop=True) for data,y in zip(lattitude,time)]

# # Export to CSV
base_dir = Path("C:/Test/Earthquake-Data")
[df.to_csv(base_dir/f"I_earthquake{year}.csv") for df,year in zip(list_dfs,year_range)]

I have got an error at df['time'].dt.month.plot(), I erased the index_col=0 and it works. I have got nice subplots, I just don't know how to make them in ascending order of years. when I print(df) the order is 2018,2019,2016,2017 so the subplots came in the same order. For the last part, I got this error. NameError: name 'random_datetimes' is not defined. I am using Jupyter notebook which installed in Anaconda environment. — sam_sam, Apr 27 '20 at 23:36
@sam_sam Sure the last part was just to *simulate your data*. The random_datetimes import can not work for you as it is a hand written custom function, that created random date ranges. — Björn, Apr 28 '20 at 06:42
I can't replicate your error. The plot are in the correct order. When you print `df_list` or `csv_list`in which order do the dataframes appear? The **output of csv_list** for me: `[WindowsPath('C:/Test/Earthquake-Data/I_earthquake2016.csv'), WindowsPath('C:/Test/Earthquake-Data/I_earthquake2017.csv'), WindowsPath('C:/Test/Earthquake-Data/I_earthquake2018.csv'), WindowsPath('C:/Test/Earthquake-Data/I_earthquake2019.csv')]` — Björn, Apr 28 '20 at 06:50
[PosixPath('/Users/…/python/i_eartquake/I_earthquake2019.csv'), PosixPath('/Users/…/python/i_eartquake/I_earthquake2018.csv'), PosixPath('/Users/…/python/i_eartquake/I_earthquake2016.csv'), PosixPath('/Users/…/python/i_eartquake/I_earthquake2017.csv')] — sam_sam, Apr 28 '20 at 12:05
updated my answer, just adding a `sorted` probably should fix the order — Björn, Apr 28 '20 at 12:25
Hi, would you accept the answer then if it solved your issue? — Björn, Apr 29 '20 at 05:09

How to read multiple csv files and plot histogram

1 Answers1

Answer 1: Read Multiple .csv Files

Answer 2: Plot Multiple Histograms

Simulate Earthquake Data