np.genfromtext importing wrong values from .csv file - 2nd column is full of commas inside quote marks

Question

EDIT: Problem solved, keeping this open for posterity.

numpy.genfrontext has trouble delimiting strings that have commas. In order to solve this, simply use pandas.read_csv and employ the quotechar = '"' to allow the importer to properly handle the strings that contain your commas.

Strange problem here.

I'm importing lists of protein data from .csv files that, for 99.9% of IDs, works flawlessly. However, 1 ID out of ~5000 thousand IDs is consistently importing the wrong data.

Here's the code I use to pull in my data. It's using glob to pull in csv files with similar names. Headers are stored as a list and then used as columns just in case the csv files have their headers mixed around (damn you, Proteome Discoverer):

indexes = ["Accession", "# Peptides", "MW [kDa]", "Score"]
headers = pd.read_csv(str(WorkingDirectory) + "/" + str(name) + "-R1.csv", nrows=1).columns.tolist()
total = [np.genfromtxt(x, delimiter = ',', skip_header = 1, usecols = [int(headers.index(indexes[0])),int(headers.index(indexes[1])),int(headers.index(indexes[2])),int(headers.index(indexes[3]))], filling_values = 0.01, dtype = ('|U16','float64','float64','float64')).tolist() for x in glob.glob(str(WorkingDirectory) + "/" + str(name) + "*.csv")]

IDs are then stored in a list, where each list entry matches the original file. [File 1, File 2, File 3]

Here's where it gets weird. Of the 5.5K entries in each .csv file, there is one ID that consistently (with code restarts) reports the wrong numbers.

Please find attached the output of my program, alongside the excel sheets that the data are sourced from. Columns A, C, E and H are my imports (Accession, Score, # Peptides and MW [kDa] respectively, in orange)

It looks like the ID's name and score are importing the correct value, but the next two columns are, respectively, off by 1 (it's importing F, not E) and then trying to pick up a value from an unspecified column that doesn't exist (hence the 0.01 due to filling values)

Things I have checked:

1) Yes, the excel headers are the same for all three files.

2) Yes, I have code in place to handle the downstream NaN nonsense that any zeros generate. So if it imports a 0 for the score, I manually change that later.

3) Yes, if there are missing values, the genfromtext filling_values = 0.01 is going to fill that gap, however for this case, it shouldn't need to fill in any gaps as there are corresponding values within the cells.

4) Every other ID I have checked is importing the data correctly.

5) Q60749 is not an unusual string. Others include: Q9CQM5, D3Z5X0, etc. No hashtags, no quote marks, no commas.

6) {From comments} All files contain only a single instance of this Protein ID

Why is this one ID causing problems out of the thousands of other successful hits? I originally found this hit because some downstream analysis said I had a NaN value; Q60749 turned out to be that value, and it's simply not importing the correct data.

Is it possible that ID Q60749 appears more than once in each file? — Ofer Sadan, Jun 06 '18 at 05:46
I think you are showing the results and the excel, but what about the corresponding line(s) in the actual csv? Looking at it with some sort of text printer or editor. That's what `genfromtxt` is working from. — hpaulj, Jun 06 '18 at 06:08
TY hpaulj. I think I've found a problem: There are actually many entries that aren't importing correctly. What they all have in common is that the second column has quote marks — PeptideWitch, Jun 06 '18 at 06:08
Q9WU01,"etc etc etc",36.7,.... -- doesn't import correctly. E9QPE7,Name,6661.88,... -- imports correctly. — PeptideWitch, Jun 06 '18 at 06:09
The 2nd column is full of commas. Of course it is. It's supposed to take that whole column as a name, and some come pre-built with quote marks and others do not. EG: Protein MON2 homolog OS=Mus musculus GN=Mon2 PE=1 SV=2 - [MON2_MOUSE] Fine. "Isoform PLEC-0,1C,2A,3A of Plectin OS=Mus musculus GN=Plec - [PLEC_MOUSE]" Not fine. — PeptideWitch, Jun 06 '18 at 06:20
It seems that the database I'm using was kind enough to pre-empt comma usage. So if the name has a comma inside of it, the csv file will wrap the name with quotation marks " ". I wonder how I can get genfromtext to recognise stuff within quote marks as a full name and to not look inside and find commas {hint put this as an answer and I'll mark it as correct} — PeptideWitch, Jun 06 '18 at 06:24
Problem solved. I changed the numpy genfrontext to pandas read_csv in order to use 'quotechar = '"' '. As mentioned here (https://stackoverflow.com/questions/17933282/using-numpy-genfromtxt-to-read-a-csv-file-with-strings-containing-commas/17942117#17942117) it seems that numpy can't handle this problem. TLDR use pandas. — PeptideWitch, Jun 06 '18 at 09:23

np.genfromtext importing wrong values from .csv file - 2nd column is full of commas inside quote marks

0 Answers0