0

I'm having trouble parsing a txt file (see here: File) Here's my code

import pandas as pd
objectname = r"path"
df = pd.read_csv(objectname, engine = 'python',  sep='\t', header=None)

Unfortunately it does not work. Since this question has been asked several times, I tried lots of proposed solutions (most of them can be found here: Possible solutions)

However, nothing did the trick for me. For instance, when I use

sep='delimiter'

The dataframe is created but everything ends up in a single column.

When I use

error_bad_lines=False

The rows I'm interested in are simply skipped.

The only way it works is when I first open the txt file, copy the content, paste it into google sheets, save the file as CSV and then open the dataframe.

I guess another workaround would be to use

df = pd.read_csv(objectname, engine = 'python',  sep = 'delimiter', header=None)

in combination with the split function Split function

Is there any suggestion how to make this work without the need to convert the file or to use the split function? I'm using Python 3 and Windows 10. Any help is appreciated.

Alfred
  • 35
  • 4
  • Please provide an example of your file (few lines, not a link). – Ricardo Rendich Aug 24 '20 at 09:29
  • Pandas is complaining about your input file. It's saying that most rows have 2 values (2 columns, if you will), but that one line (line 32) has 4 values (4 columns). I'm not about to download some random file from the internet, but my guess is that line 32 is missing a newline that would separate it into two lines with 2 columns each. – Plasma Aug 24 '20 at 09:34

1 Answers1

0

Your file has tab separators but is not a TSV. The file is a mixture of metadata, followed by a "standard" TSV, followed by more metadata. Therefore, I found tackling the metadata as a separate task from loading the data to be useful.

Here's what I did to extract the metadata lines:

with open('example.txt','r') as file_handle:
    file_content = file_handle.read().split('\n')

for index, line in enumerate(file_content):
    if index<21 or index>37:
        print(index, line.split('\t'))

Note that the lines denoting the start and stop of metadata (21 and 37 in my example) are specific to the file. I've provided the trimmed data I used below (based on your linked file).

Separately, I loaded the TSV into Pandas using

import pandas as pd
df = pd.read_csv('example.txt', engine = 'python',  
                 sep='\t', error_bad_lines=False, header=None,
                 skiprows=list(range(21))+list(range(37,89)))

Again, I skipped the metadata at the start of the file and at the end of the file.

Here's the file I experimented with. I've trimmed the extra data to reduce line count.

TITLE   Test123
DATA TYPE   
ORIGIN  JASCO
OWNER   
DATE    19/03/28
TIME    16:39:44
SPECTROMETER/DATA SYSTEM    
LOCALE  1031
RESOLUTION  
DELTAX  -0,5
XUNITS  NANOMETERS
YUNITS  CD [mdeg]
Y2UNITS HT [V]
Y3UNITS ABSORBANCE
FIRSTX    300,0000
LASTX     190,0000
NPOINTS      221
FIRSTY      -0,78961
MAXY        37,26262
MINY       -53,38971
XYDATA
300,0000    -0,789606   182,198 -0,0205245
299,5000    -0,691644   182,461 -0,0181217
299,0000    -0,700976   182,801 -0,0136756
298,5000    -0,614708   182,799 -0,0131957
298,0000    -0,422611   182,783 -0,0130073
195,0000    26,6231 997,498 4,7258
194,5000    -17,3049    997,574 4,6864
194,0000    16,0387 997,765 4,63967
193,5000    -14,4049    997,967 4,58593
193,0000    -0,277261   998,025 4,52411
192,5000    -29,6098    998,047 4,45244
192,0000    -11,5786    998,097 4,36608
191,5000    34,0505 998,282 4,27376
191,0000    28,2325 998,314 4,1701
190,5000    -13,232 998,336 4,05036
190,0000    -47,023 998,419 3,91883

##### Extended Information
[Comments]
Sample name X
Comment
User
Division
Company RWTH Aachen

[Detailed Information]
Creation date   28.03.2019 16:39

Data array type Linear data array * 3
Horizontal axis Wavelength [nm]
Vertical axis(1)    CD [mdeg]
Vertical axis(2)    HT [V]
Vertical axis(3)    Abs
Start   300 nm
End 190 nm
Data interval   0,5 nm
Data points 221     

[Measurement Information]
Instrument name CD-Photometer
Model name  J-1100
Serial No.  A001361635

Detector    Standard PMT
  Lock-in amp.  X mode
  HT volt   Auto

Accessory   PTC-514
Accessory S/N   A000161648
  Temperature   18.63 C
  Control sonsor    Holder
  Monitor sensor    Holder

Measurement date    28.03.2019 16:39

Overload detect 203
Photometric mode    CD, HT, Abs
Measure range   300 - 190 nm
Data pitch  0.5 nm
CD scale    2000 mdeg/1.0 dOD
FL scale    200 mdeg/1.0 dOD
D.I.T.  0.5 sec
Bandwidth   1.00 nm
Start mode  Immediately
Scanning speed  200 nm/min
Baseline correction Baseline
Shutter control Auto
Accumulations   3
Ben
  • 563
  • 1
  • 5
  • 12
  • Thank you! Helped me a lot. How do I (and anyone else) figure out if there is metadata in the file? I'm actually only interested in the TSV data. So `skiprows=list(range(21))+list(range(37,89)))` did the trick. Now I just need to figure out how to automatically detect the beginning and end of the metadata, since I have hundreds of these files. – Alfred Aug 25 '20 at 12:36
  • Typically the metadata structure will be consistent (as might be expected if the files were all generated by the same program). You can figure out whether that's the case by manually inspecting a few example TSV files. For example, it might be the case that the upper metadata block always terminates with `XYDATA`. If that's true, then you'd merely need to look at which line number starts with the string `XYDATA` and collect data after that line. Similarly, I'll guess that lines after `##### Extended Information` are the lower metadata block. – Ben Aug 25 '20 at 19:47
  • Thank you, heres some code that extracts the TSV data automatically: `df = pd.read_csv(objectname, engine = 'python', sep = 'decimal', decimal=",")` `start = df.loc[df[df.columns[0]] == "XYDATA"].index[0]` `end = df.loc[df[df.columns[0]] == "##### Extended Information"].index[0]` `df1 = pd.read_csv(objectname, engine = 'python', sep = '\t', decimal=",", skiprows=list(range(start+2))+list(range(end+1,len(df)+9)), header = None)` – Alfred Aug 26 '20 at 06:56