3

I have a .txt file which looks like:

# Explanatory text
# Explanatory text
# ID_1 ID_2
10310   34426
104510  4582343
1032410 5424233
12410   957422

In the file, the two IDs on the same row are separated with tabs and the tab character is encoded as '\t'

I'm trying to do some analysis using the numbers in the dataset so want to delete the first three rows. How can this be done in Python? I.e. I'd like to produce a new dataset that looks like:

10310   34426
104510  4582343
1032410 5424233
12410   957422

I've tried the following code but it didn't work:

f = open(filename,'r')
lines = f.readlines()[3:]
f.close()

It doesn't work because I get this format (a list, with \t and \n present), not the one I indicated I want above:

[10310\t34426\n', '104510\t4582343\n', '1032410\t5424233\n' ... ]

  • You can simply ignore the line if it starts with a `#` – rdas Oct 18 '20 at 16:20
  • 2
    You might want to say a little more than " it didn't work ", –  Oct 18 '20 at 16:20
  • 4
    `readlines` is zero based. Use `lines = f.readlines()[3:]` – Mike67 Oct 18 '20 at 16:21
  • 1
    What didn't work? What output did you get? What did you expect? – Joooeey Oct 18 '20 at 16:22
  • if you are using `pandas` for your EDA, there is a `skiprows` parameter in pandas.read_csv , `pandas.read_csv(filepath_or_buffer, delimiter='\t,skiprows=2)` – Shijith Oct 18 '20 at 16:23
  • skiprows = 2--> will ignore only 2nd line – aman nagariya Oct 18 '20 at 16:27
  • @Mike67 I've edited the question to include the output to show you why this doesn't work in the way I'd like. Can you tell why? –  Oct 18 '20 at 16:57
  • Your sample output looks like debugger output. At the end of your code, add `print(lines[:10])` to see the first 10 lines in the console. The `\t` and `\n` should be correctly displayed. – Mike67 Oct 18 '20 at 17:10
  • Does this answer your question? [Parsing a tab-delimited .txt into a Pandas DataFrame](https://stackoverflow.com/questions/60571932/parsing-a-tab-delimited-txt-into-a-pandas-dataframe) – Tomerikoo Oct 20 '20 at 14:16

3 Answers3

0

You Can Try Something Like this

with open(filename,'r') as fh

    for curline in fh:

         # check if the current line
         # starts with "#"

         if curline.startswith("#"):
            ...
            ...
         else:
            ...
            ...
Mario Rojas
  • 136
  • 1
  • 7
0

You can use Python's Pandas to do these kind of tasks easily:

import pandas as pd

pd.read_csv(filename, header=None, skiprows=[0, 1, 2], sep='\t')
Tomerikoo
  • 18,379
  • 16
  • 47
  • 61
0

Ok, here is the solution:

with open('file.txt') as f:
    lines = f.readlines()

lines = lines[3:]

Remove Comments

This function remove all comment lines

def remove_comments(lines):
    return [line for line in lines if line.startswith("#") == False]

Remove n number of top lines

def remove_n_lines_from_top(lines, n):
    if n <= len(lines):
        return lines[n:]
    else:
        return lines

Here is the complete source:

with open('file.txt') as f:
    lines = f.readlines()


def remove_comments(lines):
    return [line for line in lines if line.startswith("#") == False]

def remove_n_line(lines, n):
    return lines[n if n<= len(lines) else 0:]

lines = remove_n_lines_from_top(lines, 3)

f = open("new_file.txt", "w+") # save on new_file
f.writelines(lines)
f.close()
Peyman Majidi
  • 1,777
  • 2
  • 18
  • 31