1

I have a file with the below example lines:

(22642441022L, u'<a href="http://example.com">Click</a>', u'fox, dog, cat are examples http://example.com')
(1153634043, u'<a href="http://example.com">Click</a>', u"I learned so much from my mistakes, I think I'm gonna make some more")

I'm trying to parse it to a list of objects with this code:

import csv

file_path = 'Data/example.txt'
data = []

with open(file_path, 'r') as f:
    reader = csv.reader(f, skipinitialspace=True)
    for row in reader:
        data.append({'id' : row[0], 'source' : row[1], 'content' : row[2]})

As expected, the content is truncated due to the ',' in the content column. Is there any package that can help me parse this out of the box?

Mokhtar Ashour
  • 600
  • 2
  • 9
  • 21
  • Does the file actually contain `(` and `u'` ? – Anton vBR Dec 28 '17 at 17:26
  • Does the file look _exactly_ like this? – cs95 Dec 28 '17 at 17:26
  • Yes, unfortunately. I don't know which language was used to write such a file, but it's a dataset I need to load – Mokhtar Ashour Dec 28 '17 at 17:28
  • 1
    You can't parse this code with python3. Your numbers have the Long `L` suffix at the end. My guess is someone foolishly `str`d a list of tuples into a file using python2. Please kick them. – cs95 Dec 28 '17 at 17:31
  • That looks like a pure python `print` of a list of tuples... Eval comes to mind, although is probably not such a great idea (https://stackoverflow.com/questions/1832940/why-is-using-eval-a-bad-practice) – Savir Dec 28 '17 at 17:35
  • `[x.strip("""()"'""") for x in line.split(', u')]` but I don't know what to do with `L` if it's a problem – splash58 Dec 28 '17 at 17:35
  • @cᴏʟᴅsᴘᴇᴇᴅ I wish I could, but it's a dataset available as is. – Mokhtar Ashour Dec 28 '17 at 17:36
  • I don't really care about the numbers here (I can deal with them as strings), my problem is parsing columns correctly as strings – Mokhtar Ashour Dec 28 '17 at 17:42

2 Answers2

2

Looking at your data, someone has dumped the str version of a list into a file as-is, using python2.

One thing's for sure - you can't use a CSV reader for this data. You can't even use a JSON parser (which would've been the next best thing).

What you can do, is use ast.literal_eval. With python2, this works out of the box.

import ast

data = []
with open('file.txt') as f:
    for line in f:
        try:
            data.append(ast.literal_eval(line))
        except (SyntaxError, ValueError):
            pass

data should look something like this -

[(22642441022L,
  '<a href="http://example.com">Click</a>',
  'fox, dog, cat are examples http://example.com'),
 (1153634043,
  '<a href="http://example.com">Click</a>',
  "I learned so much from my mistakes, I think I'm gonna make some more")]

You can then pass data into a DataFrame as-is -

df = pd.DataFrame(data, columns=['A', 'B', 'C'])
df

             A                                       B  \
0  22642441022  <a href="http://example.com">Click</a>   
1   1153634043  <a href="http://example.com">Click</a>   

                                                   C  
0      fox, dog, cat are examples http://example.com  
1  I learned so much from my mistakes, I think I'...  

If you want this to work with python3, you'll need to get rid of the long suffix L, and the unicode prefix u. You might be able to do this using re.sub from the re module.

import re
for line in f:
    try:
        i = re.sub('(\d+)L', r'\1', line)       # remove L suffix
        j = re.sub('(?<=,\s)u(?=\')', '', i)    # remove u prefix
        data.append(ast.literal_eval(j))
    except (SyntaxError, ValueError):
        pass

Notice the added re.sub('(\d+)L', r'\1', line), which removes the L suffix at the end of a string of digits.

cs95
  • 379,657
  • 97
  • 704
  • 746
  • Thanks for you answer, but it didn't work. It throws syntax error at all lines. I'm using python 3.6 BTW. – Mokhtar Ashour Dec 28 '17 at 18:05
  • @MokhtarAshour If you could show me one of the lines that are erroring out, that would help... because, it works for the data you posted ;-( – cs95 Dec 28 '17 at 18:06
  • (22642586115L, 7248952, 1283282654000L, 0, -1, -1, None, -999999.0, -999999.0, u'Ping.fm', 0, 0, u'CPPRI Recruitment 2010 at http://example.com/', -1, u'', u'') – Mokhtar Ashour Dec 28 '17 at 18:08
  • @MokhtarAshour Weird that `ast` sometimes parses unicode, and sometimes not. This would've worked on python2, but I'll need regex to get rid of the unicodes. Give me a few minutes. – cs95 Dec 28 '17 at 18:14
  • @MokhtarAshour Edits made, try it now and let me know. – cs95 Dec 28 '17 at 18:16
1

So it looks like the file was generated doing something like this (a pure dump of a Python str() or print):

data_list = [
    (22642441022L, u'<a href="http://example.com">Click</a>', u'fox, dog, cat are examples http://example.com'),
    (1153634043, u'<a href="http://example.com">Click</a>', u"I learned so much from my mistakes, I think I'm gonna make some more")
]  # List of tuples

with open('./stack_084.txt', 'w') as f:
    f.write('\n'.join([str(data) for data in data_list]))

Regular expressions come to mind (assuming that the values on your second "column") always start with <a and end with a>:

import pprint
import re

line_re = re.compile(
    r'\('
    r'(?P<num>\d+)L{0,1}.'
    r'+?'
    r'[\'\"](?P<source>\<a.+?a\>)[\"\']'
    r'.+?'
    r'[\'\"](?P<content>.+?)[\"\']'
    r'\)'
)

data = []
with open('./stack_084.txt', 'r') as f:
    for line in f:
        match = line_re.match(line)
        if match:
            data.append({
                'id': int(match.groupdict()['num']),
                'source': match.groupdict()['source'],
                'content': match.groupdict()['content']
            })

# You should see parsed data here:
print(pprint.pformat(data))

This outputs:

[{'content': 'fox, dog, cat are examples http://example.com',
  'id': 22642441022,
  'source': '<a href="http://example.com">Click</a>'},
 {'content': "I learned so much from my mistakes, I think I'm gonna make some "
             'more',
  'id': 1153634043,
  'source': '<a href="http://example.com">Click</a>'}]
Savir
  • 17,568
  • 15
  • 82
  • 136
  • I see you use Regex to handle it, but the actual lines in the file are longer (I included a subset). This is one line of the real dataset : (22642586115L, 7248952, 1283282654000L, 0, -1, -1, None, -999999.0, -999999.0, u'Ping.fm', 0, 0, u'CPPRI Recruitment 2010 at example.com/';, -1, u'', u'') This way I will need carefully to write a REGEX that handles the whole line – Mokhtar Ashour Dec 28 '17 at 18:12
  • Numbers are easy... Just more of the `(\d+)L{0,1}` groups... And those `None`... Those are suspicious. I imagine you have rows where the 7th value is not `None`? (I don't imagine that whatever generated your data is gonna have **ALL** `None`(s), right?) – Savir Dec 28 '17 at 18:21
  • Yes, some rows contain something like u'MyComfyCat' – Mokhtar Ashour Dec 28 '17 at 18:30
  • 1
    I have accepted the first answer, still like your answer though. so I'm up voting – Mokhtar Ashour Dec 28 '17 at 18:31