Using regex separators with read_csv() in python?

Question

I have a lot of csv files formated as such:

date1::tweet1::location1::language1

date2::tweet2::location2::language2

date3::tweet3::location3::language3

and so on. Some files contain up to 200 000 tweets. I want to extract 4 fields and put them in a pandas dataframe, as well as count the number of tweets. Here's the code I'm using for now:

try:
    data = pd.read_csv(tweets_data_path, sep="::", header = None, engine='python')
    data.columns = ["timestamp", "tweet", "location", "lang"]
    print 'Number of tweets: ' + str(len(data))

except BaseException, e :
    print 'Error: ',str(e)

I get the following error thrown at me

Error: expected 4 fields in line 4581, saw 5

I tried setting error_bad_lines = False, manually deleting the lines that make the program bug, setting nrows to a lower number.. and still get those "expected fields" errors for random lines. Say I delete the bottom half of the file, I will get the same error but for line 1787. Which doesn't make sense to me as it was processed correctly before. Visually inspecting the csv files doesn't reveal abornmal patterns that suddenly appear in the buggy line either.

The date fields and tweets contain colons, urls and so on so perhaps regex would make sense?

Can someone help me figure out what I'm doing wrong? Many thanks in advance!

Sample of the data as requested below:

Fri Apr 22 21:41:03 +0000 2016::RT @TalOfer: Barack Obama: Brexit would put UK back of the queue for trade talks [short url] #EuRef #StrongerIn::United Kingdom::en

Fri Apr 22 21:41:07 +0000 2016::RT @JamieRoss7: It must be awful to strongly believe in Brexit and be watching your campaigns make an absolute horse's arse of it.::The United Kingdom::en

Fri Apr 22 21:41:07 +0000 2016::Whether or not it rains on June 23rd will  have more influence on the vote than Obama's lunch with the Queen and LiGA with George. #brexit.::Dublin, Ireland::en

Fri Apr 22 21:41:08 +0000 2016::FINANCIAL TIMES FRONT PAGE: 'Obama warns Brexit vote would send UK to 'back of trade queue' #skypapers [short url]::Mardan, Pakistan::en

/@user start by removing the "engine". And, please include actual data sample .5-10 rows. — Merlin, Jun 09 '16 at 00:04
Hello Merlin and thanks for replying! Removing the engine gives me a "ParserWarning Falling back to python engine because the 'c' engine does not support regex separators". I have edited the OP with actual data — user2763524, Jun 09 '16 at 00:21

score 0 · Answer 1 · edited May 23 '17 at 12:31

0

Have you tried read_table instead? I've got this kind of error when I tried to use read_csv before and I solved the problem by using it. Please refer to this post, this might give you some ideas about how to solve the error. And maybe also try sep=r":{2}" as delimiter.

edited May 23 '17 at 12:31

Community

1
1

answered Jun 08 '16 at 23:09

Andreas Hsieh

2,080
1
10
8

Thank you for your reply! I tried read_table, changing the sep value to what you suggested, as well as the other suggestions in the thread you linked.. still running into the same issue :( – user2763524 Jun 09 '16 at 00:00

Merlin · Accepted Answer · 2016-06-09T01:06:33.407

Start with this:

pd.read_csv(tweets_data_path, sep="::", header = None, usecols = [0,1,2,3])

The above should bring in 4 columns, then you can figure out how many lines were dropped, and if the data makes sense.

Use this pattern:

data["lang"].unique()

Since, you have problem with data and do not where it is. You need to step back and use python 'csv reader'. This should get you started.

import csv
reader = csv.reader(tweets_data_path)
tweetList = []
for row in reader:
    try:  
        tweetList.append(  (row[0].split('::')) )
    except BaseException, e :
        print 'Error: ',str(e)

print tweetList

tweetsDf =   pd.DataFrame(tweetList)



print tweetsDf
                                   0  \
    0   Fri Apr 22 21:41:03 +0000 2016   
    1   Fri Apr 22 21:41:07 +0000 2016   
    2   Fri Apr 22 21:41:07 +0000 2016   
    3   Fri Apr 22 21:41:08 +0000 2016   

                                                       1                   2     3  
0  RT @TalOfer: Barack Obama: Brexit would put UK...      United Kingdom    en  
1  RT @JamieRoss7: It must be awful to strongly b...  The United Kingdom    en  
2  Whether or not it rains on June 23rd will  hav...              Dublin  None  
3  FINANCIAL TIMES FRONT PAGE: 'Obama warns Brexi...              Mardan  None

Just tried this, same error.. thanks for your help so far, much appreciated! — user2763524, Jun 09 '16 at 00:26
Hello, I was able to fix the problem by specifying: index_col=[0,1,2,3], usecols=[0,1,2,3] in read_csv. Using one or the other did not work as a standalone. Many thanks for your help, this was a bit of an obscure bug to figure out :) — user2763524, Jun 11 '16 at 15:09

Using regex separators with read_csv() in python?

2 Answers2