Pulling out data from CSV files' specific columns in Python

Question

I need a quick help with reading CSV files using Python and storing it in a 'data-type' file to use the data to graph after storing all the data in different files.

I have searched it, but in all cases I found, there was headers in the data. My data does not header part. They are tab separated. And I need to store only specific columns of the data. Ex:

12345601 2345678@abcdef 1 2 365 places

In this case, as an example, I would want to store only "2345678@abcdef" and "365" in the new python file in order to use it in the future to create a graph.

Also, I have more than 1 csv file in a folder and I need to do it in each of them. The sources I found did not talk about it and only referred to:

# open csv file
with open(csv_file, 'rb') as csvfile:

Could anyone refer me to already answered question or help me out with it?

please check this post [Link](https://stackoverflow.com/questions/29287224/pandas-read-in-table-without-headers) it may help you — Dipak Mallick, Jun 18 '19 at 13:01
@DipakMallick during my time working on this project, the linked was helpful for me. Thanks for letting me know — r_e, Dec 13 '19 at 18:37

score 1 · Accepted Answer · edited Jun 20 '20 at 09:12

. . . and storing it in a PY file to use the data to graph after storing all the data in different files . . .

. . . I would want to store only "2345678@abcdef" and "365" in the new python file . . .

Are you sure that you want to store the data in a python file? Python files are supposed to hold python code and they should be executable by the python interpreter. It would be a better idea to store your data in a data-type file (say, preprocessed_data.csv).

To get a list of files matching a pattern, you can use python's built-in glob library.

Here's an example of how you could read multiple csv files in a directory and extract the desired columns from each one:

import glob

# indices of columns you want to preserve
desired_columns = [1, 4]
# change this to the directory that holds your data files
csv_directory = '/path/to/csv/files/*.csv'

# iterate over files holding data
extracted_data = []
for file_name in glob.glob(csv_directory):
    with open(file_name, 'r') as data_file:
        while True:
            line = data_file.readline()
            # stop at the end of the file
            if len(line) == 0:
                break

            # splits the line by whitespace
            tokens = line.split()
            # only grab the columns we care about
            desired_data = [tokens[i] for i in desired_columns]
            extracted_data.append(desired_data)

It would be easy to write the extracted data to a new file. The following example shows how you might save the data to a csv file.

output_string = ''
for row in extracted_data:
    output_string += ','.join(row) + '\n'

with open('./preprocessed_data.csv', 'w') as csv_file:
    csv_file.write(output_string)

Edit:

If you don't want to combine all the csv files, here's a version that can process one at a time:

def process_file(input_path, output_path, selected_columns):
    extracted_data = []    
    with open(input_path, 'r') as in_file:
        while True:
            line = in_file.readline()
            if len(line) == 0: break
            tokens = line.split()
            extracted_data.append([tokens[i] for i in selected_columns])
    
    output_string = ''
    for row in extracted_data:
        output_string += ','.join(row) + '\n'
    
    with open(output_path, 'w') as out_file:
        out_file.write(output_string)

# whenever you need to process a file:
process_file(
    '/path/to/input.csv', 
    '/path/to/processed/output.csv',
    [1, 4])

# if you want to process every file in a directory:
target_directory = '/path/to/my/files/*.csv'
for file in glob.glob(target_directory):
    process_file(file, file + '.out', [1, 4])

Edit 2:

The following example will process every file in a directory and write the results to a similarly-named output file in another directory:

import os
import glob

input_directory = '/path/to/my/files/*.csv'
output_directory = '/path/to/output'
for file in glob.glob(input_directory):
    file_name = os.path.basename(file) + '.out'
    out_file = os.path.join(output_directory, file_name)
    process_file(file, out_file, [1, 4])

If you want to add headers to the output, then process_file could be modified like this:

def process_file(input_path, output_path, selected_columns, column_headers=[]):
    extracted_data = []    
    with open(input_path, 'r') as in_file:
        while True:
            line = in_file.readline()
            if len(line) == 0: break
            tokens = line.split()
            extracted_data.append([tokens[i] for i in selected_columns])
    
    output_string = ','.join(column_headers) + '\n'
    for row in extracted_data:
        output_string += ','.join(row) + '\n'
    
    with open(output_path, 'w') as out_file:
        out_file.write(output_string)

You are right, I did not mean to say I want to store them in a Python file. I want them in a file where I can use them later on to create graphs. For the code you have, does it combine all the files together? I do not need to combine all the csv files. I just wanted to know how I specify which CSV file I am going to read and import to a new data file. I am new to Py programming's this part. That's why I am having a little bit trouble in my explanation — r_e, Jun 18 '19 at 13:23
Ah, okay that makes sense! The solution I posted will indeed combine all the files. I'll edit my answer with a version that operates on a single file — zachdj, Jun 18 '19 at 13:41
Thank you so much for the update. I got a couple of questions. 1) Where do i need to mention that the delimiter is TAB? 2) Do I need to create output.csv prior to running the code or will the code create the file? Thanks again for the help — r_e, Jun 18 '19 at 14:08
1. By default, Python's `str.split` method will split a string by whitespace, so there's no need to explicitly mention that the delimiter is tab. 2. The output file will automatically be created, but you'll need to create the output directory in advance. (This can also be done programmatically with `os.mkdirs`). Hope that helps! — zachdj, Jun 18 '19 at 15:10
Hello! Yes, it helped a lot! It worked pretty well. Thanks for all the help. I got two more questions if you would not mind to answer. 1) How do I send the outputs to the (already created) output directory? Where do I mention /code that? 2) Can I add headers to the new printed columns. If yes, how could I do that? — r_e, Jun 18 '19 at 15:16
The problem is that you've hard-coded the output file with `output_dir + 'practice.out'`. The first csv file gets processed and writes its rows to `practice.out`. Then the second one will get processed, and _overwrite_ the `practice.out` file, leaving it with only 5 rows. I'll add another edit with some help. — zachdj, Jun 18 '19 at 15:33
Thank you very much for the new edit and help! My last question for this 'question thread' is when I save the outputs, it saves the file's output's name as: ```filename.csv.out. ```Is there a way to get rid of ```.csv``` part in the name? Thanks again. — r_e, Jun 18 '19 at 16:49
Also, if you do not mind, could you please explain how does process_file work and add it as a note to your answer? @zachdj — r_e, Jun 20 '19 at 12:18

John Moore · Answer 2 · 2019-06-19T16:47:24.607

Here's another approach using a namedtuple that will help extract selected fields from a csv file and then let you write them out to a new csv file.

from collections import namedtuple    
import csv

# Setup named tuple to receive csv data
# p1 to p5 are arbitrary field names associated with the csv file
SomeData = namedtuple('SomeData', 'p1, p2, p3, p4, p5, p6')

# Read data from the csv file and create a generator object to hold a reference to the data
# We use a generator object rather than a list to reduce the amount of memory our program will use
# The captured data will only have data from the 2nd & 5th column from the csv file
datagen = ((d.p2, d.p5) for d in map(SomeData._make, csv.reader(open("mydata.csv", "r"))))

# Write the data to a new csv file
with open("newdata.csv","w", newline='') as csvfile:
    cvswriter = csv.writer(csvfile, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)

    # Use the generator created earlier to access the filtered data and write it out to a new csv file
    for d in datagen:
        cvswriter.writerow(d)

Original Data in "mydata.csv":

12345601,2345678@abcdef,1,2,365,places  
4567,876@def,0,5,200,noplaces

Output Data in "newdata.csv":

2345678@abcdef,365  
876@def,200

EDIT 1: For tab delimited data make the following changes to the code:
change
datagen = ((d.p2, d.p5) for d in map(SomeData._make, csv.reader(open("mydata.csv", "r"))))
to
datagen = ((d.p2, d.p5) for d in map(SomeData._make, csv.reader(open("mydata2.csv", "r"), delimiter='\t', quotechar='"')))
and
cvswriter = csv.writer(csvfile, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
to
cvswriter = csv.writer(csvfile, delimiter='\t', quotechar='"', quoting=csv.QUOTE_MINIMAL)

Thanks John for the explanation. – r_e Jun 25 '19 at 17:21 — r_e, Jun 25 '19 at 17:21

Pulling out data from CSV files' specific columns in Python

2 Answers2