0

I have scrapped some contents from a web site and saved the data into some different csv files.

For example,

csv1:-

row number    time              price
1              2018/01/01        12
2              2018/01/02        15

csv2:-

row number    time              address
1              2018/01/01        MI
2              2018/01/02        AR

Now, how can I Merge the two csv files into one csv file and below is the format of new csv.

row number    time              price         address
1              2018/01/01        12             MI
2              2018/01/02        15             AR

Can someone help me?

This question has confused me several days.

Thanks a lot!

enter image description here

enter image description here

hygull
  • 8,464
  • 2
  • 43
  • 52
Yao Qiang
  • 33
  • 6

4 Answers4

0

You may use Pandas df.append(). You may reference this answer.

If these CSVs have different columns, then individually read each one of them as a Pandas DataFrame, and then create a new DataFrame referencing columns from previously created individual DataFrames.

Random Nerd
  • 134
  • 1
  • 9
  • Actualy, I have done like this , but there are some problems. The new csv files has all the data, but the same time data are in different rows according to their vulunms. For the example, the new csv file should have 2 rows , but after append(), it has 4 rows. – Yao Qiang Nov 21 '18 at 14:53
0

For your case, you can also use pd.merge command of pandas:

In [488]: df1 = pd.read_csv('/home/mayankp/Documents/Personal/stackoverflow/csv1.csv')

In [498]: df1
Out[498]: 
   row_number        time  price
0           1  2018/01/01     12
1           2  2018/01/02     15

In [490]: df2 = pd.read_csv('/home/mayankp/Documents/Personal/stackoverflow/csv2.csv')

In [499]: df2
Out[499]: 
   row_number        time address
0           1  2018/01/01      MI
1           2  2018/01/02      AR

In [500]: pd.merge(df1,df2, on=['row_number','time'])
Out[500]: 
   row_number        time  price address
0           1  2018/01/01     12      MI
1           2  2018/01/02     15      AR
Mayank Porwal
  • 33,470
  • 8
  • 37
  • 58
  • 1
    Very helpful! Thank you very much! – Yao Qiang Nov 21 '18 at 23:54
  • I am sorry, there is another new problem. In my dataset, not all the column has the same number of rows, for example, price starts from 2018/01/01, but address starts from 2017/11/01. In this situation, the new csv file would only start from 2018/01/01. And it would drop the data of address from 2017/11/01 to 2017/12/31. So, how can I deal with this problem. – Yao Qiang Nov 22 '18 at 00:06
0

Try the following:

import pandas as pd

csv1 = pd.read_csv("file1.csv")
csv2 = pd.read_csv("file2.csv")

csv_out = csv1.merge(csv2, on=['row number','time'])
csv_out.to_csv("file_out.csv", index=False)

Hope it helps.

Taher A. Ghaleb
  • 5,120
  • 5
  • 31
  • 44
  • Very helpful! Thank you very much! – Yao Qiang Nov 21 '18 at 23:53
  • I am sorry, there is another new problem. In my dataset, not all the column has the same number of rows, for example, price starts from 2018/01/01, but address starts from 2017/11/01. In this situation, the new csv file would only start from 2018/01/01. And it would drop the data of address from 2017/11/01 to 2017/12/31. So, how can I deal with this problem. – Yao Qiang Nov 22 '18 at 00:06
  • I see. Can you please update your question to include these cases? – Taher A. Ghaleb Nov 22 '18 at 00:13
  • Great. You'll now just need to tick mark one of the answers that you feel it fits your needs as **Accepted**. Thanks. – Taher A. Ghaleb Nov 22 '18 at 00:20
0

I know you have csv files but here I am just showing and trying to help you by manually creating DataFrames as you have mentioned in the problem.

Below is the code that you're looking for.

>>> import pandas as pd
>>>
>>> dri = pd.date_range("2018/01/01", periods=2, freq="d")
>>>
>>> df = pd.DataFrame({"time": dri, "price": [12, 15]}, index = [1, 2])
>>> df
        time  price
1 2018-01-01     12
2 2018-01-02     15
>>>
>>> df2 = pd.DataFrame({"time": dri, "address": ["MI", "AR"]}, index=[1, 2])
>>> df2
        time address
1 2018-01-01      MI
2 2018-01-02      AR
>>>
>>> # https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html
...
>>>
>>> df.merge(df2, on = "time", how = "inner", left_index = True)
        time  price address
1 2018-01-01     12      MI
2 2018-01-02     15      AR
>>>

By default, pandas does not include labels for index on left of DataFrame. If you really wish to have labels for the index of DataFrame as you have mentioned (In your case, that is row number), have a look into below executed statements on Python interactive terminal.

>>> df.index.name = "row number"
>>> df
                 time  price
row number
1          2018-01-01     12
2          2018-01-02     15
>>>
>>> df2.index.name = "row number"
>>>
>>> df2
                 time address
row number
1          2018-01-01      MI
2          2018-01-02      AR
>>>
>>> df.merge(df2, on = "time", how = "inner", left_index = True)
                 time  price address
row number
1          2018-01-01     12      MI
2          2018-01-02     15      AR
>>>
Community
  • 1
  • 1
hygull
  • 8,464
  • 2
  • 43
  • 52
  • Very helpful! Thank you very much! – Yao Qiang Nov 21 '18 at 23:54
  • I am sorry, there is another new problem. In my dataset, not all the column has the same number of rows, for example, price starts from 2018/01/01, but address starts from 2017/11/01. In this situation, the new csv file would only start from 2018/01/01. And it would drop the data of address from 2017/11/01 to 2017/12/31. So, how can I deal with this problem. – Yao Qiang Nov 22 '18 at 00:06
  • Okay @Yao, just provide any output format of your data so that I could know your intention in a better way. You can create gist in github and send the link of input and output formats. That will help me to help you or if you wish you can add a little description in this problem as well. Thank you for replying me. – hygull Nov 22 '18 at 04:55