How to Split a Huge CSV File in Linux?

Question

I have 60TB of data that resides in 12 csv files.

The data will be loaded into a clustered database where the loading processes is single threaded. In order to improve my load performance I need to initiate a load process from each node.

So far so good from this point of view. My biggest problem is how can I split this data? It is zipped, and each csv file has around 5TB of data! I tried split but it takes too long!

Nik · Answer 1 · 2014-07-03T17:52:54.397

1

The easiest but not the fastest, most likely, way is

unzip -p <zipfile> | split -C <size>

edited Jul 03 '14 at 17:52

answered Jul 03 '14 at 17:41

Nik

469
3
4

score 0 · Answer 2 · answered Jul 03 '14 at 18:04

Assuming the order of the data unimportant, one way to do this -not so much of a faster way- but at least somewhat parallel would be to write a script that does the following.

Open the zip file.
Get the first file.
Read the data out of the file, say in lines.
For each csv line, write out a new zipfile containing the line.
Rotate the file selected (say five zipfiles) using the output of one line.
Once you reach a certain size (say 50GB) create a brand new zip file.

This isn't any faster than a sequential read of the big file, but allows you to split up the file into smaller chunks which can be loaded in parallel whilst the remaining data is completed.

Like most compressed output, its not seekable (you cannot jump X bytes ahead) so the biggest downside you have is if the process aborts for some reason you'd be forced to restart the whole thing from scratch.

Python provides support for doing something like this via the zipfile module.

score 0 · Answer 3 · answered Jul 03 '14 at 18:12

Do you have to load the 12 files in order or can they be imported in parallel?

I ask because it would seem that if they have to be loaded in order then splitting them further won't enable you to run anything in parallel anyway and if they don't then you can import the 12 files you already have in parallel.

If the files aren't already available on the nodes, transferring them there may take as long as the import anyway.

Bottlenecks can show up in surprising places. Have you started the single-thread import process and verified that the nodes are underutilised? You may be solving the wrong problem if you haven't checked.

How to Split a Huge CSV File in Linux?

3 Answers3