3

I am new to Python and I would like to use pandas for reading the data. I have done some searching and effort to solve my issue but still I am struggling. thanks for your help in advance!

I have a.txt file looking like this;

skip1
 A1| A2 |A3 |A4# A5# A6 A7| A8 , A9
1,2,3,4,5,6,7,8,9
1,2,3,4,5,6,7,8,9
1,2,3,4,5,6,7,8,9

END***
Some other data starts from here

The first task is that I would like to assign A1,A2,A3,A4,A5,A6,A7,A8 and A9 as column names. However, there are multiple separators such as ' ','|','#' and this makes hassle to assign separator when reading the file. I tried like this;

import pandas as pd
import glob
filelist=glob.glob('*.txt')
print(filelist)

df = pd.read_csv(filelist,skiprows=1,skipfooter=2,skipinitialspace=True, header=0, sep=r'\| |,|#',engine='python') 

But it seems that nothing is happened when I check Spyder's data explorer df.

The second task is that during the reading removing the data starting with the rows END*** that I don't need. The header has always the same length. However, skipfooter needs the number of lines to skip, which should be changed between the files.

Some several questions already been asked but It seems I can't make them work on my question!

how-to-read-txt-file-in-pandas-with-multiple-delimiters

pandas-read-delimited-file?

import-text-to-pandas-with-multiple-delimiters

pandas-ignore-all-lines-following-a-specific-string-when-reading-a-file-into-a

EDIT: about removing the the reading removing the data starting with the rows END

If the b.txt file like this b.txt

skip1
 A1| A2 |A3 |A4# A5# A6 A7| A8 , A9
1,2,3,4,5,6,7,8,9
1,2,3,4,5,6,7,8,9
1,2,3,4,5,6,7,8,9

END123
Some other data starts from here

an by using the second solution below;

txt = open('b.txt').read().split('\nEND')[0]
_, h, txt = txt.split('\n', 2)
pat = r'[\|, ,#,\,]+'
names = re.split(pat, h.strip())

pd.read_csv(
    pd.io.common.StringIO(txt),
    names=names, header=None,
    engine='python')

Getting this,

   A1  A2  A3  A4  A5  A6  A7  A8  A9
0   1   2   3   4   5   6   7   8   9
1   1   2   3   4   5   6   7   8   9
2   1   2   3   4   5   6   7   8   9
Alexander
  • 4,527
  • 5
  • 51
  • 98
  • @piRSquared it exactly same as in the example.Just save it as a.txt and read it! That is how I did. Sorry the real file is too big. a.txt is just a representative. – Alexander Aug 15 '17 at 14:39

2 Answers2

4

answering first question:

In [182]: df = pd.read_csv(filename, sep=r'\s*(?:\||\#|\,)\s*', 
                           skiprows=1, engine='python')

In [183]: df
Out[183]:
   A1  A2  A3  A4  A5  A6 A7  A8  A9
1   2   3   4   5   6      7   8   9
1   2   3   4   5   6      7   8   9
1   2   3   4   5   6      7   8   9
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
4

Split the file, then read from string

txt = open('test.txt').read().split('\nEND***')[0]
pd.read_csv(
    pd.io.common.StringIO(txt),
    sep=r'\W+',
    skiprows=1, engine='python')

   A1  A2  A3  A4  A5  A6  A7  A8  A9
0   1   2   3   4   5   6   7   8   9
1   1   2   3   4   5   6   7   8   9
2   1   2   3   4   5   6   7   8   9

We can be very explicit with the parsing of the header and parse the rest of the file as csv

txt = open('test.txt').read().split('\nEND***')[0]
_, h, txt = txt.split('\n', 2)
pat = r'[\|, ,#,\,]+'
names = re.split(pat, h.strip())

pd.read_csv(
    pd.io.common.StringIO(txt),
    names=names, header=None,
    engine='python')

   A1  A2  A3  A4  A5  A6  A7  A8  A9
0   1   2   3   4   5   6   7   8   9
1   1   2   3   4   5   6   7   8   9
2   1   2   3   4   5   6   7   8   9
piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • Thanks for the great answer but I just found some issues. it seems that A6 and A7 columns are combined. Although they are separated with whitespace and I would prefer to have skip rows starting with `END` because the strings after the `END` changes in other file. I would really appreciate if you could you update your solution based on these issues. Thanks! – Alexander Aug 15 '17 at 15:03
  • @Alexander I changed the `sep`. It works for this example and should work so long as your column names are contiguous alphanumeric strings. Your file is of undetermined length. I won't know how many lines to skip until the end. According to your file, splitting at `'END'` should work. – piRSquared Aug 15 '17 at 15:55
  • thanks for the updating the `sep` part. Yes my file is of undetermined length. If I change the `END***` to `END**` it does not delete this row and the rest. I guess it is same if there is only `END`. My point is that if there is a rows starting with 'END' remove it and everything after it! – Alexander Aug 15 '17 at 16:10
  • @Alexander I'm failing to see where this strategy breaks down. If you have an example where it doesn't work, share it in your post so that I/we can see it. Then I can address the element that is getting in the way. – piRSquared Aug 15 '17 at 16:20
  • Sorry for misguiding! I edited the OP. Please check the EDIT part. I'll remove it after your comment! – Alexander Aug 15 '17 at 16:33
  • Ahh, I see. The `'END***'` is now `'END123'` Just split on `'END'` – piRSquared Aug 15 '17 at 16:41
  • Spot ON!. Sorry I didn't catch your previous point! Thanks man for your guidance! I really appreciated it! – Alexander Aug 15 '17 at 16:47
  • No problem, glad we could help. – piRSquared Aug 15 '17 at 16:49
  • I just sent new question about new issue that I encountered and linked to this post! [multiple-delimiters-in-column-headers-also-separates-the-row-values](https://stackoverflow.com/questions/45699077/multiple-delimiters-in-column-headers-also-separates-the-row-values) – Alexander Aug 15 '17 at 18:28