1

Background

I'm looking to convert a text file of ~1.1m lists into JSON, and then into a pandas dataframe. The file is currently set up such that each list is separated by a newline only, and structured in the following manner:

['Here is a string!', 'London, England', [[-2.68, 50.92], [-2.68, 50.96], [-2.61, 50.96], [-2.61, 50.92]], 'FakeUserName', 1234567, [('581294', 'Other_user')]]

Problem

I'd like to convert each list into JSON and subsequently write to a new file, which I can then use in a separate call to pd.read_json. I am having difficulty owing to the variable length of the mentions element (no limit on the number of mentions tuples). Ideally the resulting dataframe would have the following columns:

+-----+--------------------+-----------------------+----------------+------------+---------+--------------------------+
|     |       String       |          LOC          |       BB       |   User     |   ID    |         Mentions         |
+-----+--------------------+-----------------------+----------------+------------+---------+--------------------------+
|   0 | "Here is a string" | ('London', 'England') | [[-2.68..],..] | 'FakeUser' | 1234567 | [(581294, 'other_user')] |
|   1 |                    |                       |                |            |         |                          |
| ... |                    |                       |                |            |         |                          |
+-----+--------------------+-----------------------+----------------+------------+---------+--------------------------+

Work Done So Far

  • Processing each line with ast.literal_eval(line) to allow indexing.
  • Attempted to convert each line using json.dumps(line) and then pass to a dataframe. This converts the list into a JSON array resulting in less than ideal interpretation of what each column should be when then passing to pd.read_json
  • Unsuccessful use of json_normalize as described in How to flatten a pandas dataframe with some columns as json?.
  • Formatting each column manually: df = pd.DataFrame({"String": list[0], "LOC":list[1]... })
  • Creation of custom class (similar to: https://stackoverflow.com/a/44195896/7322036)

Any suggestions for things I've missed? This is proving to be a lot more difficult than i had initially assumed.

EDIT

Added the example list into the table to demonstrate what I'm attempting to do.

A. Prague
  • 55
  • 4
  • 1
    You should show one record (what you have done) and how it should go into the dataframe, and anoter record exhibiting the mentions number problem, along with the way it goes into the dataframe. I may be tired after my work day but I cannot guess what you really want... – Serge Ballesta Feb 11 '20 at 18:31
  • @SergeBallesta Does the edit help? I'm not sure how I could explain it further without repeating myself. As for the mentions number problem - in the last column of the dataframe, the list of tuples can be of any length - there is a variable amount of tuples within that list, for a given list in the overarching file. – A. Prague Feb 11 '20 at 18:43

1 Answers1

1

If I have correctly understood your problem, passing by json only adds complexity.

The DataFrame contructor should be enough:

with open('file.txt') as fd:
    df = pd.DataFrame(columns=['String', 'LOC', 'BB', 'User', 'ID', 'Mentions'],
                      data = [ast.literal_eval(line) for line in fd])

Repeating 4 times your sample, I got:

              String              LOC                                                 BB          User       ID                Mentions
0  Here is a string!  London, England  [[-2.68, 50.92], [-2.68, 50.96], [-2.61, 50.96...  FakeUserName  1234567  [(581294, Other_user)]
1  Here is a string!  London, England  [[-2.68, 50.92], [-2.68, 50.96], [-2.61, 50.96...  FakeUserName  1234567  [(581294, Other_user)]
2  Here is a string!  London, England  [[-2.68, 50.92], [-2.68, 50.96], [-2.61, 50.96...  FakeUserName  1234567  [(581294, Other_user)]
3  Here is a string!  London, England  [[-2.68, 50.92], [-2.68, 50.96], [-2.61, 50.96...  FakeUserName  1234567  [(581294, Other_user)]
Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252