1

I have a huge text file (4 GB), where each "line" is of the syntax: [number] [number]_[number] [Text].
For example

123 12_14 Text 1
1234 13_456 Text 2
33 12_12 Text 3
24 678_10 Text 4

My purpose is to have this data saved as Excel file, where each "line" in the text file,
is a row in the excel file. According to the past example:

[A1] 123
[B1] 12_14
[C1] Text 1
[A2] 1234
[B2] 13_456
[C2] Text 2
[A3] 33
[B3] 12_12
[C3] Text 3
[A4] 24
[B4] 678_10
[C4] Text 4

My plan is to iterate the text "lines", as advised here, separate the "lines",
and save to the cells in an excel file.

Because of the text size issue, I thought to create many small excel files, which all together will be equal to the text file.

Than I need to analyze the small excel files, mainly found terms that where mentioned in the [Text] cells, and count the number of apperance, related to the [number] cells (representing a post and ID of a post).

Finally, I need to sum all this data in an excel file.

I'm considering the best way to create and analyze the excel files.
As mentioned here the main libraries are xlrd and csv.

Community
  • 1
  • 1
Presen
  • 1,809
  • 4
  • 31
  • 46
  • 1
    If your only goal is to analyze the original text file (things like counting stuff...), then this is very much **not** the easiest way to do it. If you have some other reason for ending up with a bunch of excel files, then carry on. – Brionius Aug 14 '13 at 12:32
  • You don't need to load the whole text file into memory, you can just [read the text line by line](http://www.yak.net/fqa/171.html). The splitted up text as excel is completely unnecessary. – Daniel Aug 14 '13 at 12:36
  • @Brionius I'm pretty sure I don't have other options than small excel files, but what will be the another approach? – Presen Aug 14 '13 at 12:46
  • @simpleBob the code suggested (http://stackoverflow.com/questions/18178089/parsing-big-text-file-using-regex/18178159#18178159) dosen't load the whole text into memory – Presen Aug 14 '13 at 12:49
  • @pnuts It was a beginner's mistake, sorry. – Presen Aug 14 '13 at 12:50
  • @user1869297 - it's hard to say what another approach would be unless I know what your actual goal is - if you can clarify what the end-goal of this whole business is, we can help you find the best way. – Brionius Aug 14 '13 at 12:55
  • read a file line by line in excel: http://stackoverflow.com/questions/11528694/read-parse-text-file-line-by-line-in-vba/11528932#11528932 – SeanC Aug 14 '13 at 13:34

2 Answers2

3

"I'm pretty sure I don't have other options than small excel files, but what will be the another approach?"

Your huge text file is a type of database, although an inconvenient one. A bunch of small Excel files are another, even less convenient representation of the same database. I assume you are looking to make a bunch of small files because Excel has an upper limit on how many rows it can contain (65'000 or 1'000'000 depending on the version of Excel). However, as has been noted, Excel files are truly horrible database stores.

Since you are already using Python, use module sqlite3, it's already built in and it's a real database, and it can handle more than a million rows. And it's fast.

But I wanted to get an idea how fast it is with data on the scale that you propose so I created a 30M row database of roughly the same complexity as your dataset. The schema is simple:

create table words
    (id integer primary key autoincrement not null,
     first text, second text, third text);

and populated it with random trigrams drawn from /usr/dict/words (I have a module for generating test data like this which makes entries that look like

sqlite> select * from words limit 5;
1|salvation|prorates|bird
2|fore|embellishment|empathized
3|scalier|trinity|graze
4|exes|archways|interrelationships
5|voguish|collating|partying

but a simple query for a row I knew was near the end took longer than I'd hoped:

select * from words where first == "dole" and second == "licked";
29599672|dole|licked|hates

took about 7 seconds on a pretty average 3-year-old desktop so I added a couple of indexes

create index first on words (first);
create index second on words (second);

which did double the size of the database file from 1.1GB to 2.3GB but brought the simple query time down to a rather reasonable 0.006 second. I don't think you'll do as well with Excel.

So parse your data however you must, but then put it in a real database.

Community
  • 1
  • 1
msw
  • 42,753
  • 9
  • 87
  • 112
0

What is the issue with just looping through the file line by line? If you have your heart set on excel I would reccomend openpyxl.

Martyn
  • 806
  • 1
  • 8
  • 20