2

I’m very new to python and data science in general, however, I would love feedback on how to accomplish this and any recommendations. I have written a Python script with the help of pandas to reformat a .csv file into the correct format for import onto an instrument (BioRad CFX384 for you biologists). I would like to make this script work for a broad range of .csv files instead of just one at a time.

The goal would be to save a non-import-formatted file into a folder then have this script reformat the newly added csv files every 3-5 minutes. After writing to the file, it will move the newly formatted file into a nested folder. How do I complete this part?

I have all the formatting done—-all I need to do is generalize the script to work for all the csvs in a folder and move them into the new location. I imagine I will probably utilize import os.

Here is the code I have so far

import pandas as pd
mydata = pd.read_csv("Example_Export_File_2.csv", header = 6)

mydata.drop(["Position", "ID_1", "Name"], axis = 1, inplace = True)

mydata["*Target Name"]= "N1"

mydata["*Biological Group"]= "Respiratory"

mydata.columns = ['*Sample Name', '*Target Name', '*Biological Group']

new_column = pd.DataFrame({'Row': [
    'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 
    'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B',
    'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C',
    'D', 'D', 'D', 'D', 'D', 'D', 'D', 'D', 'D', 'D', 'D', 'D', 'D', 'D', 'D', 'D', 'D', 'D', 'D', 'D', 'D', 'D', 'D', 'D',
    'E', 'E', 'E', 'E', 'E', 'E', 'E', 'E', 'E', 'E', 'E', 'E', 'E', 'E', 'E', 'E', 'E', 'E', 'E', 'E', 'E', 'E', 'E', 'E',
    'F', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'F',
    'G', 'G', 'G', 'G', 'G', 'G', 'G', 'G', 'G', 'G', 'G', 'G', 'G', 'G', 'G', 'G', 'G', 'G', 'G', 'G', 'G', 'G', 'G', 'G',
    'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H',
    'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I',
    'J', 'J', 'J', 'J', 'J', 'J', 'J', 'J', 'J', 'J', 'J', 'J', 'J', 'J', 'J', 'J', 'J', 'J', 'J', 'J', 'J', 'J', 'J', 'J',
    'K', 'K', 'K', 'K', 'K', 'K', 'K', 'K', 'K', 'K', 'K', 'K', 'K', 'K', 'K', 'K', 'K', 'K', 'K', 'K', 'K', 'K', 'K', 'K',
    'L', 'L', 'L', 'L', 'L', 'L', 'L', 'L', 'L', 'L', 'L', 'L', 'L', 'L', 'L', 'L', 'L', 'L', 'L', 'L', 'L', 'L', 'L', 'L',
    'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M',
    'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N',
    'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O',
    'P', 'P', 'P', 'P', 'P', 'P', 'P', 'P', 'P', 'P', 'P', 'P', 'P', 'P', 'P', 'P', 'P', 'P', 'P', 'P', 'P', 'P', 'P', 'P',]})
mydata = mydata.merge(new_column, left_index = True, right_index = True)

new_column = pd.DataFrame({'Column': [
    '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', 
    '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24',
    '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24',
    '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24',
    '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24',
    '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24',
    '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24',
    '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24',
    '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24',
    '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24',
    '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24',
    '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24',
    '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24',
    '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24',
    '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24',
    '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24',]})
mydata = mydata.merge(new_column, left_index = True, right_index = True)
mydata = mydata[['Row', 'Column', '*Target Name', '*Sample Name', '*Biological Group']]

mydata.to_csv("Example_Export_File_2.csv", index=False)

print(mydata)

Much thanks in advance.

Ted Corl
  • 23
  • 3
  • depends on what all the other csvs look like. sounds like you need a way of finding the right index in place of the hardcoded column/header index. – willwrighteng Oct 14 '20 at 03:01
  • All of the csv files will look exactly the same, the only thing that will differ is their file name and their contents in two of the columns. Here is a link so you can see the files that will be located in the folder in which the script will act on. https://gofile.io/d/vkrGSi – Ted Corl Oct 14 '20 at 03:08
  • okay, so your question is how do you do this same transformation across a bunch of files in a directory? edit: I see now, a new file shows up every few minute – willwrighteng Oct 14 '20 at 03:11

1 Answers1

0

There are a few ways to address your question -- I'll say outright that there are ways to schedule a script to run but they vary in how easy they are to implement and how reliable they are. I'll find a link that goes into detail (scheduling python script with bash script).

For my response, I'll just say that doing this type of thing in batches, manually, is much easier.

If your python script is in the same directory then os.listdir() will give you a list of all the files. I usually throw some kind of list comprehension on top of that to grab only csvs

[i for i in filelist if '.csv' in i]

With a list of the files in your local directory, it's a matter of doing a for loop over your pd.read_csv process some stuff then df.to_csv

Navigating a directory is a little more complicated but you use the same os package. SO post for reference

willwrighteng
  • 1,411
  • 11
  • 25
  • Thank you @will.cass.wrig ! I seem to be importing the directory list properly but when I compile that script with my reformatting script, it says `FileNotFoundError: [Errno 2] File Example_Export_File.csv does not exist: 'Example_Export_File.csv'` Any ideas as to why I am getting this error? It's clearly reading the directory properly and I know the file is in the directory. – Ted Corl Oct 14 '20 at 04:08
  • Nevermind, I got it. I used a different method. I used `for file in glob.glob("/my directory/*.csv"): pd.read_csv(open(file, 'rb'), header = 1) mydata =pd.read_csv(open(file, 'r'), header = 6)` – Ted Corl Oct 14 '20 at 04:38
  • I've never used `open` within `read_csv` since it has a built-in open function ([docs](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)). Glad you got things working! I don't miss my days in the lab but I sympathize. – willwrighteng Oct 14 '20 at 16:17