0

Let's say I have a csv file that is such that:

Dog   
Cat
Bird   

is a common column with a txt file I have that has two columns of interest:

Cat            8.9
Bird           12.2
Dog            2.1

One column being the identifiers (species name), the other being let's say, speed in mph.

I want to parse through the rows of the csv file, and lookup the species speed from my txt file, and then add it as an extra column to the csv file. Ie, I want to join these two files based on the common species key, and I specifically JUST want to append the speed from the txt (and not other columns that may be in that file) to the csv file. The csv file also has superfluous columns that I would like to keep; but I don't want any of the superfluous columns in the txt file (let's say it had heights, weights, and so on. I don't want to add those to my csv file).

Let's say that this file is very large; I have 1,000,000 species in both files.

What would be the shortest and cleanest way to do this? Should I write some type of a nested loop, should I use a dictionary, or would Pandas be an effective method?

Note: let's say I also want to compute the mean speed of all the species; ie I want to take the mean of the speed column. Would numpy be the most effective method for this?

Additional note: All the species names are in common, but the order in the csv and txt files are different (as indicated in my example). How would I correct for this?

Additional note 2: This is a totally contrived example since I've been learning about dictionaries and input/output files, and reviewing loops that I wrote in the past.

Note 3: The txt file should be tab seperated, but the csv file is (obviously) comma separated.

martineau
  • 119,623
  • 25
  • 170
  • 301
glico
  • 3
  • 4
  • Use [pandas](https://pandas.pydata.org/) – Dani Mesejo Nov 01 '20 at 16:03
  • Out of instinct I would say pandas, because there are a lot of functions that are using c, so they are doing it faster – Charalamm Nov 01 '20 at 16:04
  • CSV files may be delimited by (single) character other than a comma — think Character Separated Values — so tab characters, for example could be used. This means you could use the `csv` module to read what you're calling the "text" file. (CSV file are text files, too.) – martineau Nov 01 '20 at 16:18
  • Don't use nested loops, everything else is ok. – Wups Nov 01 '20 at 16:45

3 Answers3

2

You could do everything needed with the built-in csv module. As I mentioned in a comment, it can be used to read the text file since it's tab-delimited (as well as read the csv file and write an updated version of it).

You seem to indicate there are other fields besides the "animal" and "speed" in both files, but the code below assumes they only have one or both of them.

import csv

csv_filename = 'the_csv_file.csv'
updated_csv_filename = 'the_csv_file2.csv'
txt_filename = 'the_text_file.txt'

# Create a lookup table from text file.
with open(txt_filename, 'r', newline='') as txt_file:
    # Use csv module to read text file.
    reader = csv.DictReader(txt_file, ('animal', 'speed'), delimiter='\t')
    lookup = {row['animal']: row['speed'] for row in reader}

# Read csv file and create an updated version of it.
with open(csv_filename, 'r', newline='') as csv_file, \
     open(updated_csv_filename, 'w', newline='') as updated_csv_file:

    reader = csv.reader(csv_file)
    writer = csv.writer(updated_csv_file)
    for row in reader:
        # Insert looked-up value (speed) into the row following the first column
        # (animal) and copy any columns following that.
        row.insert(1, lookup[row[0]])  # Insert looked-up speed into column 2.
        writer.writerow(row)

Given the two input files in your question, here's the contents of the updated csv file (assuming there were no additional columns):

Dog,2.1
Cat,8.9
Bird,12.2
martineau
  • 119,623
  • 25
  • 170
  • 301
  • So to clarify this on dictionaries; if, let's say the txt file calls the column with 1 million species "Animal" but the CSV file calls it "Species", how would this change the code above? For context, I took an intro-level comp sci course but there wasn't enough time to cover dictionaries, so I'm just trying to figure out the logic behind this code. I recognize the for loops and how you're writing it out, but not exactly the rest. Also since this assumes there are no columns, if I want to keep the cols from the CSV file but add the new one to it, do I just do species/speed/all other cols? – glico Nov 06 '20 at 02:39
1

This is probably most easily achieved with pandas DataFrames.

You can load both your CSV and text file using the read_csv function (just adjust the separator for the text file to tab) and use the join function to join the two DataFrames on the columns you want to match, something like:

column = pd.read_csv('data.csv')
data = pd.read_csv('data.txt', sep='\t')

joined = column.join(data, on='species')

result = joined[['species', 'speed', 'other column you want to keep']]

If you want to conduct more in depth analysis of your data or your files are too large for memory, you may want to look into importing your data into a dedicated database management system like PostgreSQL.

EDIT: If your files don't contain column names, you can load them with custom names using pd.read_csv(file_path, header=None, names=['name1','name2']) as described here. Also, columns can be renamed after loading using dataframe.rename(columns = {'oldname':'newname'}, inplace = True) as seen here.

Spiess
  • 9
  • 3
  • What about the case if the header was named differently? Ie; let's say the csv file says "Animal" but the txt file says "species" – glico Nov 02 '20 at 15:26
  • @Rina columns can be renamed using `dataframe.rename(columns = {'oldname':'newname'}, inplace = True)` as seen [here](https://www.geeksforgeeks.org/how-to-rename-columns-in-pandas-dataframe/). – Spiess Nov 03 '20 at 11:05
0

You can just use the merge() method of pandas like this:

import pandas as pd

df_csv = pd.read_csv('csv.csv', header=None)
df_txt = pd.read_fwf('txt.txt', header=None)

result = pd.merge(df_txt,df_csv)

print(result)

Gives the following output:

      0     1
0   Cat   8.9
1  Bird  12.2
2   Dog   2.1
Alexander Riedel
  • 1,329
  • 1
  • 7
  • 14