Python & Pandas: missing rows when combining csv's

Question

New to Python and Pandas, all insights appreciated on this.

I'm working on a script that takes two csvs and combines them. However, the final output fails to write some rows, or those rows get overwritten - it's unclear to me what's happening.

The first csv, posts.csv is structured like this, with 21 rows (with a header row):

user_id,    text
6354,   text1
5457,   text2
5109,   text3

The second csv, replies.csv is similarly structured, with 38 rows (including header). The user_id field in the first and second csv's refers to the same users:

user_id,    text
5457,   texta
5109,   textb
5350,   textc

Here's my code for combining the two csv's:

df = pd.concat(
    map(pd.read_csv, ['posts.csv', 'replies.csv']), ignore_index=True)

df.to_csv("Control2.csv", index=False)

My output file, Control2.csv, should contain 58 rows (57 rows + 1 header row). However, only 43 rows are written. It appears there are missing rows from both csv's. Any idea what may be happening here? All assistance appreciated.

Does this answer your question? [Import multiple csv files into pandas and concatenate into one DataFrame](https://stackoverflow.com/questions/20906474/import-multiple-csv-files-into-pandas-and-concatenate-into-one-dataframe) — Chris, Aug 20 '21 at 21:10
Many thanks for this link. However, the same issue occurred with the script on that page - exact same rows are missing. — Daniel Hutchinson, Aug 20 '21 at 21:13
This question would be easier to answer if you would provide plain text for both csvs. — Henry Ecker, Aug 20 '21 at 21:13
This is a guess only, but is your user_id column being read in as the index? because if so, and if you have matching indexes in both files that could maybe be the issue? — Christopher J. Joubert, Aug 20 '21 at 21:22
I can't reproduce. Both provided csvs read in correctly and the appended results produce the expected output file in pandas 1.3.1. — Henry Ecker, Aug 20 '21 at 21:33
@DanielHutchinson I dont think pandas is a must in this case. See my answer below. — balderman, Aug 20 '21 at 21:35

score 1 · Answer 1 · answered Aug 20 '21 at 21:20

See below (no external lib is needed)
The idea is to read the first file headers + data. For the other files we will read the data only.

file_names = ['1.csv', '2.csv']
with open('result.csv', 'w') as outfile:
    for idx, file_name in enumerate(file_names):
        with open(file_name) as infile:
            if idx == 0:
                outfile.write(infile.read())
            else:
                lines = infile.readlines()[1:]
                outfile.writelines(lines)

Python & Pandas: missing rows when combining csv's

1 Answers1