1

I have a tab-delimited set of data with some columns not in the correct order, and missing.

Some rows have 12 columns, others have 13 or 14 or 15 columns.

When I look at the data, I find the following types of rows

.... Col_10: 25    Col_11: 23432    Col12: 639142
.... Col_10: 25    Col_12: 25134    Col13: 243344
.... Col_11: 75    Col_13: 79876    Col15: 634534    Col12: 5    Col14: 73453
.... Col_10: 25    Col_11: 32425    Col13: 989423
.... Col_10: 25    Col_11: 23424    Col12: 342421    Col13: 7    Col14: 13424    Col 15: 67
.... Col_10: 95    Col_11: 32121    Col15: 111231

So, each column from Col_10 to Col_15 is labeled such that column_name: value.

I would like to read this data into a pandas dataframe using

import pandas as pd
df = pd.read_table("fname.dat")

where fname.dat is the format for the data above.

How do you do this? It appears I need to allocate space for 15 columns, and then parse the data such that if it begins with Col_10, this value should be put in the 10th column, if Col_11 then in the 11th, etc.

How does one read in such data with pandas?

ShanZhengYang
  • 16,511
  • 49
  • 132
  • 234
  • 1
    Add rows from fname.dat file to the question. How you read the file will solve this. – Merlin Jul 20 '16 at 22:55
  • 1
    Also, trying reading as tab-delimited to see if tabs are there and not being read properly. – Merlin Jul 20 '16 at 23:02
  • @Merlin Above are the rows. Each row value is exactly shown; they've labeled each value with the column, e.g. "Col_11: value" is a value – ShanZhengYang Jul 20 '16 at 23:16
  • @ShanZhengYang, if the data set you've posted in your question is exactly in the same format as your real data, then it's not a TSV (tab separated values) - it's "broken" [fixed-width formatted](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_fwf.html) file. I've put "broken" - because it's not a standard CSV/TSV/fixed file, as it has various number of columns – MaxU - stand with Ukraine Jul 21 '16 at 08:55
  • @MaxU How does `pandas.read_fwf()` deal with these varying row length with columns in different order? Does this need to be manually parsed? If so, how? – ShanZhengYang Jul 21 '16 at 12:43
  • I'd try to read this text file into a dict first and then create a data frame. BTW, there are two `Col_13` in the third row. Do the original rows contain duplicates? If so, how to interpret it? – ptrj Jul 21 '16 at 18:01
  • @ptrj Edited---mistake. The file is several GB in size---how would you read into a dictionary? – ShanZhengYang Jul 21 '16 at 18:31
  • Ah, I see. It doesn't have to be a dict. It may be a pre-allocated numpy array - creating a data frame from it wouldn't copy the array. Reading text data from a file is fast, so I guess the main problem is string parsing (it may be slow in a loop). Let me think about it... – ptrj Jul 21 '16 at 18:59
  • @ptrj There should be some way to pre-allocated "empty" columns into a 15-16 column dataframe, and then parse/place each of these values into a dataframe. – ShanZhengYang Jul 21 '16 at 19:28
  • Allocating is easy: either 1) `np.zeros` or 2) `np.empty` and assign NaN's to it or 3) `pd.DataFrame` with specified cols and index but no data. Tricky is efficient data parsing. Another idea: if memory is an issue, you may read data in chunks (dealing with var number of cols as [here](http://stackoverflow.com/questions/15242746/handling-variable-number-of-columns-with-pandas-python)), parse, and assign to a pre-allocated array/frame. Also: depending on your data you may use int32/float32 instead of int64/float64 to save space. – ptrj Jul 21 '16 at 19:44
  • @MaxU It doesn't seem to work with `pandas.read_fwf()`. How about this: could we allocate "empty" columns and then parse the values (`column_name:value`) of these columns in order to title them? The following rows could then be individually sorted. Otherwise, I'm lost. – ShanZhengYang Jul 25 '16 at 17:49
  • @ShanZhengYang, could you post or upload somewhere 10-30 rows with the different number of columns either from the original file or in __exactly the same format__ (but with anonymized data, if you have a sensible data) – MaxU - stand with Ukraine Jul 25 '16 at 17:59
  • @MaxU Ok, but I might get busted for a repost. – ShanZhengYang Jul 25 '16 at 18:09
  • you can mention that it wasn't answered and post a link to this question – MaxU - stand with Ukraine Jul 25 '16 at 18:12
  • @MaxU New question here: http://stackoverflow.com/questions/38575481/how-to-parse-labeled-values-of-columns-into-a-pandas-dataframe-some-column-valu – ShanZhengYang Jul 25 '16 at 18:56

0 Answers0