How to import unlabeled and missing columns into a pandas dataframe?

Question

I have a tab-delimited set of data with some columns not in the correct order, and missing.

Some rows have 12 columns, others have 13 or 14 or 15 columns.

When I look at the data, I find the following types of rows

.... Col_10: 25    Col_11: 23432    Col12: 639142
.... Col_10: 25    Col_12: 25134    Col13: 243344
.... Col_11: 75    Col_13: 79876    Col15: 634534    Col12: 5    Col14: 73453
.... Col_10: 25    Col_11: 32425    Col13: 989423
.... Col_10: 25    Col_11: 23424    Col12: 342421    Col13: 7    Col14: 13424    Col 15: 67
.... Col_10: 95    Col_11: 32121    Col15: 111231

So, each column from Col_10 to Col_15 is labeled such that column_name: value.

I would like to read this data into a pandas dataframe using

import pandas as pd
df = pd.read_table("fname.dat")

where fname.dat is the format for the data above.

How do you do this? It appears I need to allocate space for 15 columns, and then parse the data such that if it begins with Col_10, this value should be put in the 10th column, if Col_11 then in the 11th, etc.

How does one read in such data with pandas?

Add rows from fname.dat file to the question. How you read the file will solve this. — Merlin, Jul 20 '16 at 22:55
Also, trying reading as tab-delimited to see if tabs are there and not being read properly. — Merlin, Jul 20 '16 at 23:02
@Merlin Above are the rows. Each row value is exactly shown; they've labeled each value with the column, e.g. "Col_11: value" is a value — ShanZhengYang, Jul 20 '16 at 23:16
@ShanZhengYang, if the data set you've posted in your question is exactly in the same format as your real data, then it's not a TSV (tab separated values) - it's "broken" [fixed-width formatted](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_fwf.html) file. I've put "broken" - because it's not a standard CSV/TSV/fixed file, as it has various number of columns — MaxU - stand with Ukraine, Jul 21 '16 at 08:55
@MaxU How does `pandas.read_fwf()` deal with these varying row length with columns in different order? Does this need to be manually parsed? If so, how? — ShanZhengYang, Jul 21 '16 at 12:43
I'd try to read this text file into a dict first and then create a data frame. BTW, there are two `Col_13` in the third row. Do the original rows contain duplicates? If so, how to interpret it? — ptrj, Jul 21 '16 at 18:01
@ptrj Edited---mistake. The file is several GB in size---how would you read into a dictionary? — ShanZhengYang, Jul 21 '16 at 18:31
Ah, I see. It doesn't have to be a dict. It may be a pre-allocated numpy array - creating a data frame from it wouldn't copy the array. Reading text data from a file is fast, so I guess the main problem is string parsing (it may be slow in a loop). Let me think about it... — ptrj, Jul 21 '16 at 18:59
@ptrj There should be some way to pre-allocated "empty" columns into a 15-16 column dataframe, and then parse/place each of these values into a dataframe. — ShanZhengYang, Jul 21 '16 at 19:28
Allocating is easy: either 1) `np.zeros` or 2) `np.empty` and assign NaN's to it or 3) `pd.DataFrame` with specified cols and index but no data. Tricky is efficient data parsing. Another idea: if memory is an issue, you may read data in chunks (dealing with var number of cols as [here](http://stackoverflow.com/questions/15242746/handling-variable-number-of-columns-with-pandas-python)), parse, and assign to a pre-allocated array/frame. Also: depending on your data you may use int32/float32 instead of int64/float64 to save space. — ptrj, Jul 21 '16 at 19:44
@MaxU It doesn't seem to work with `pandas.read_fwf()`. How about this: could we allocate "empty" columns and then parse the values (`column_name:value`) of these columns in order to title them? The following rows could then be individually sorted. Otherwise, I'm lost. — ShanZhengYang, Jul 25 '16 at 17:49
@ShanZhengYang, could you post or upload somewhere 10-30 rows with the different number of columns either from the original file or in __exactly the same format__ (but with anonymized data, if you have a sensible data) — MaxU - stand with Ukraine, Jul 25 '16 at 17:59
you can mention that it wasn't answered and post a link to this question — MaxU - stand with Ukraine, Jul 25 '16 at 18:12
@MaxU New question here: http://stackoverflow.com/questions/38575481/how-to-parse-labeled-values-of-columns-into-a-pandas-dataframe-some-column-valu — ShanZhengYang, Jul 25 '16 at 18:56

How to import unlabeled and missing columns into a pandas dataframe?

0 Answers0

Linked