Multiple delimiters in column headers also separates the row values

Question

I had a some issue about defining multiple seperator when reading a file. It is originally solved in my previous post reading-files-with-multiple-delimiter-in-column-headers-and-skipping-some-rows thanks to @piRsquared

When I looked in detail to my real data I realized that some columns have .cd or .dvd extensions and when I applied the solution above they are also separated as a new column and the solution above started not to work!

b.txt

skip1
 A1| A2 |A3 |A4# A5# A6 A7| A8 , A9
1,2,3,4,5.cd,6,7,8.dvd,9
1,2,3,4,5.cd,6,7,8.dvd,9
1,2,3,4,5.cd,6,7,8.dvd,9

END123
Some other data starts from here

and read this b.txt file using the solution above

txt = open('b.txt').read().split('\nEND')[0]
pd.read_csv(
    pd.io.common.StringIO(txt),
    sep=r'\W+',
    skiprows=1,index_col=False, engine='python')

   A1  A2  A3  A4  A5  A6  A7  A8  A9
0   1   2   3   4   5  cd   6   7   8
1   1   2   3   4   5  cd   6   7   8
2   1   2   3   4   5  cd   6   7   8

A5 column should have the rows

5.cd
5.cd
5.cd

and same for the A9 column

8.dvd
8.dvd
8.dvd

and we should have A9 column but seems that it disappears due to this conflict.

EDIT:

I put almost similar identity to my real data

 skip rows
 A1| A2| A3|A4# A5#  |  A6 | A7  , A8,  A9  | A10 |
 1 | 2 | 3 |4 # 5 #  | 6.cd|7.dvd,   ,      | 10  | 
 1 | 2 | 3 |4 # 5 #  | 6.cd|     ,   ,   9  | 10  |
 1 | 2 | 3 |4 # 5 #  |     |7.dvd,   ,      | 10  |

END123
Some other data starts from here

and tried

txt = open('real_dat.txt').read().split('\nEND')[0]
_, h, txt = txt.split('\n', 2)
pat = r'[\|, ,#,\,]+'
names = re.split(pat, h.strip())

df=pd.read_csv(
    pd.io.common.StringIO(txt),
    names=names,skiprows=1,index_col=False,
    engine='python')

and got this output!

piRSquared · Accepted Answer · 2017-08-15T21:53:28.817

2

Updated Answer
It was just easier to get rid of spaces... Let me know if this works

txt = open('b.txt').read().split('\nEND')[0] \
    .replace(' ', '').replace('|\n', '\n').split('\n', 1)[1]

pd.read_csv(
    pd.io.common.StringIO(txt),
    sep=r'#\||\||#|,',
    engine='python')

   A1  A2  A3  A4  A5    A6     A7  A8   A9  A10
0   1   2   3   4   5  6.cd  7.dvd NaN  NaN   10
1   1   2   3   4   5  6.cd    NaN NaN  9.0   10
2   1   2   3   4   5   NaN  7.dvd NaN  NaN   10

Old Answer

I used \W+ as a fast and easy way to parse what you showed. Below I used something more specific to the actual delimiters you need.

txt = open('b.txt').read().split('\nEND')[0]
pd.read_csv(
    pd.io.common.StringIO(txt),
    sep=r'[\|, ,#,\,]+',
    skiprows=1,index_col=False, engine='python')

   A1  A2  A3  A4    A5  A6  A7     A8  A9
0   1   2   3   4  5.cd   6   7  8.dvd   9
1   1   2   3   4  5.cd   6   7  8.dvd   9
2   1   2   3   4  5.cd   6   7  8.dvd   9

However, I still think this is a cleaner way to do it. Here, I separate the parsing of the header from the parsing of the rest of the data. That way, I assume the data should only be using , as a separator.

txt = open('b.txt').read().split('END')[0]
_, h, txt = txt.split('\n', 2)
pat = r'[\|, ,#,\,]+'
names = re.split(pat, h.strip())

pd.read_csv(
    pd.io.common.StringIO(txt),
    names=names, header=None,
    engine='python')

   A1  A2  A3  A4    A5  A6  A7     A8  A9
0   1   2   3   4  5.cd   6   7  8.dvd   9
1   1   2   3   4  5.cd   6   7  8.dvd   9
2   1   2   3   4  5.cd   6   7  8.dvd   9

edited Aug 15 '17 at 21:53

answered Aug 15 '17 at 18:39

piRSquared

285,575
57
475
624

I am sorry for bothering you again. But I have been trying to your solution more than an hour and found that only the first solution works now with my real data:) and I don't know why the later one doesn't! – Alexander Aug 15 '17 at 18:51
1

I'm assuming the data is all comma separated. Maybe it's not. Also, if you can reduce the problem down to just the data that doesn't work, we can fix it. And don'w worry about "bothering" me. I have a choice to help or not help. If I choose to help, then you're not bothering me (-: – piRSquared Aug 15 '17 at 18:54
Thank you for choosing to `help!`. You solved one my biggest issue you can't imagine:). BTW, if the file is tab separated what would it be the code? – Alexander Aug 15 '17 at 18:58
use `sep=\t` instead – piRSquared Aug 15 '17 at 19:00
Do you mean that something like this `sep=r'\t[\|, ,#,\,]+'`. I just tried it and it seems doesn't work. – Alexander Aug 15 '17 at 19:05
1

No, you need to put the `\t` inside the square brackets. This is regex syntax and transcends python. regex is extremely powerful and can be very confusing. `sep=r'[\t,\|, ,#,\,]+'` – piRSquared Aug 15 '17 at 19:26
Ok I see. Arggh! I just found if there is an empty rows in one of the columns and the other columns shifted to fill it! I can't believe it. This data is killing me! Could you check the OP, I added the EDIT part! – Alexander Aug 15 '17 at 19:35
1

This is a direct result of trying to find one magic regex that parses your data and header. Our separator at the moment takes any comma, bar, hash, or space either one or several at a time. I allowed for several in order to capture the separation of `'A7| A8'` and `'A5# A6'`. But this has gotten in the way and is interpreting `,,` as a single separator instead of 2. I recommend that you attempt solution 2 again and when it doesn't work, post the smallest amount of sample data that recreates the problem. Until then, I'm attempting to find the magic regex that parses both header and data. – piRSquared Aug 15 '17 at 19:45
When I attempted the second solution it works on b.txt. OTH, When I used the same code on my real data I have everything in one single column with no column name. – Alexander Aug 15 '17 at 19:59
Ok, so create the smallest file you can in which that happens. Then share that in your post. – piRSquared Aug 15 '17 at 20:02
sorry I was on travel. I updated the OP. Could you check the EDIT part. Many Thanks! – Alexander Aug 15 '17 at 20:56
1

Ok, so you really do need the magic bullet regex. I'll work on it. – piRSquared Aug 15 '17 at 21:21
thanks! I really appreciated it! I believe your solution will be future reference for future issues like this;) – Alexander Aug 15 '17 at 21:42
worked great. I really really appreciate your help! best regards! – Alexander Aug 15 '17 at 22:19
@Alexander also... whoever created this file deserves coal for christmas. – piRSquared Aug 15 '17 at 22:43
I was telling almost the similar thing to the person who gave me this data to me :)) I never worked this kind of stupidly constructed data format but your approach to help is very much appreciated! – Alexander Aug 15 '17 at 23:10

Multiple delimiters in column headers also separates the row values

1 Answers1

Linked