2

In pandas read_csv, is there a way to specify eg. col1, col15, wholeline?

I am trying to import about 700000 rows of data from a text file which has hats '^' as delimiters, no text qualifiers and carriage return as line delimiter.

From the text file I need column 1, column 15 and then the whole line in three columns of a table/dataframe.

I've searched how to do this in pandas, but don't know it well enough to get the logic. I can import fine for all 26 columns, but that doesn't help my problem.

my_df = pd.read_csv("tablefile.txt", sep="^", lineterminator="\r",  low_memory=False)

Or I can use standard python to put the data into a table, but this takes about 4 hours for the 700000 rows. which is far too long for me.

count_1 = 0
for line in open('tablefile.txt'):
    if count_1 > 70:
        break
    else:
        col1id = re.findall('^(\d+)\^', line)
        col15id = re.findall('^.*\^.*\^(\d+)\^.*\^.*\^.*\^.*\^.*\^.*\^.*\^.*\^.*\^.*\^.*', line)
        line = line.strip()

        count_1 = count_1 + 1

        cur.execute('''INSERT INTO mytable (mycol1id, mycol15id, wholeline) VALUES (?, ?, ?)''', 
        (col1id[0], col15id[0], line, ) )

        conn.commit()
    print('row count_1=',count_1)

In pandas read_csv, is there a way to specify eg. col1, col15, wholeline?

As in above, col1 and col15 are digits and wholeline is a string

  • I do not want to rebuild the string after import as I might lose some characters in the process.

Thanks

EDIT: Committing to the database for each line was burning time.

CArnold
  • 465
  • 4
  • 7
  • 16
  • When using only python, you should compile regular expressions once, outside your loop. This must speed up things – Guillaume Jacquenot Feb 09 '17 at 11:03
  • I don't understand how this would work, I thought re.findall(regex, object) required object to be created before calling re.findall. do you have an example? – CArnold Feb 09 '17 at 14:24

3 Answers3

3

Read in the full lines as one df using some quasi-separator (below im using &), then read again with usecols and specify the index for cols 1 and 15 and add them together.

my_df_full = pd.read_csv("tablefile.txt", sep="&", lineterminator="\r", low_memory=False)
my_df_full.columns = ['full_line']

my_df_cols = pd.read_csv("tablefile.txt", sep="^", lineterminator="\r", low_memory=False, usecols=[1,15])

my_df_full[['col1', 'col15']] = my_df_cols
Karl Anka
  • 2,529
  • 1
  • 19
  • 30
  • It's proving difficult to find a separator that isn't in the text, but i'll carry on looking. – CArnold Feb 09 '17 at 14:29
  • @CArnold if you dont find any separator you can concatenate all the columns, a bit tedious but should work. See this: http://stackoverflow.com/questions/19377969/combine-two-columns-of-text-in-dataframe-in-pandas-python. Im not sure if it works, but you can also try to remove `low_memory = False` and use strings as separators. `sep="c_arnold_pandas"` – Karl Anka Feb 09 '17 at 14:45
1

First, you can compile your regular expressions to avoid parsing them for each line

import re

reCol1id = re.compile('^(\d+)\^')
reCol15id = re.compile('^.*\^.*\^(\d+)\^.*\^.*\^.*\^.*\^.*\^.*\^.*\^.*\^.*\^.*\^.*')

count_1 = 0
for line in open('tablefile.txt'):
    if count_1 > 70:
        break
    else:
        col1id = reCol1id.findall(line)[0]
        col15id = reCol15id.findall(line)[0]
        line = line.strip()

        count_1 += 1

        cur.execute('''INSERT INTO mytable (mycol1id, mycol15id, wholeline) VALUES (?, ?, ?)''', 
        (col1id, col15id, line, ) )

        conn.commit()
    print('row count_1=',count_1)
Guillaume Jacquenot
  • 11,217
  • 6
  • 43
  • 49
0

I put the conn.commit() on the outside of the for loop. It reduced the load time to a few minutes, though I'm guessing it's less safe.

Anyway thanks for the help.

CArnold
  • 465
  • 4
  • 7
  • 16