-1

New to Python and Pandas, all insights appreciated on this.

I'm working on a script that takes two csvs and combines them. However, the final output fails to write some rows, or those rows get overwritten - it's unclear to me what's happening.

The first csv, posts.csv is structured like this, with 21 rows (with a header row):

user_id,    text
6354,   text1
5457,   text2
5109,   text3

The second csv, replies.csv is similarly structured, with 38 rows (including header). The user_id field in the first and second csv's refers to the same users:

user_id,    text
5457,   texta
5109,   textb
5350,   textc

Here's my code for combining the two csv's:

df = pd.concat(
    map(pd.read_csv, ['posts.csv', 'replies.csv']), ignore_index=True)

df.to_csv("Control2.csv", index=False)

My output file, Control2.csv, should contain 58 rows (57 rows + 1 header row). However, only 43 rows are written. It appears there are missing rows from both csv's. Any idea what may be happening here? All assistance appreciated.

  • 1
    Does this answer your question? [Import multiple csv files into pandas and concatenate into one DataFrame](https://stackoverflow.com/questions/20906474/import-multiple-csv-files-into-pandas-and-concatenate-into-one-dataframe) – Chris Aug 20 '21 at 21:10
  • Many thanks for this link. However, the same issue occurred with the script on that page - exact same rows are missing. – Daniel Hutchinson Aug 20 '21 at 21:13
  • 2
    This question would be easier to answer if you would provide plain text for both csvs. – Henry Ecker Aug 20 '21 at 21:13
  • Edits made for plain text - thanks. – Daniel Hutchinson Aug 20 '21 at 21:16
  • 1
    This is a guess only, but is your user_id column being read in as the index? because if so, and if you have matching indexes in both files that could maybe be the issue? – Christopher J. Joubert Aug 20 '21 at 21:22
  • 1
    I can't reproduce. Both provided csvs read in correctly and the appended results produce the expected output file in pandas 1.3.1. – Henry Ecker Aug 20 '21 at 21:33
  • 1
    @DanielHutchinson I dont think pandas is a must in this case. See my answer below. – balderman Aug 20 '21 at 21:35

1 Answers1

1

See below (no external lib is needed)
The idea is to read the first file headers + data. For the other files we will read the data only.

file_names = ['1.csv', '2.csv']
with open('result.csv', 'w') as outfile:
    for idx, file_name in enumerate(file_names):
        with open(file_name) as infile:
            if idx == 0:
                outfile.write(infile.read())
            else:
                lines = infile.readlines()[1:]
                outfile.writelines(lines)
balderman
  • 22,927
  • 7
  • 34
  • 52