88

I have files with invalid characters like these

009_-_�%86ndringshåndtering.html

It is a Æ where something have gone wrong in the filename.

Is there a way to just remove all invalid characters?

or could tr be used somehow?

echo "009_-_�%86ndringshåndtering.html" | tr ???
Sandra
  • 10,711
  • 41
  • 120
  • 173

11 Answers11

75

I had some japanese files with broken filenames recovered from a broken usb stick and the solutions above didn't work for me.

I recommend the detox package:

The detox utility renames files to make them easier to work with. It removes spaces and other such annoyances. It'll also translate or cleanup Latin-1 (ISO 8859-1) characters encoded in 8-bit ASCII, Unicode characters encoded in UTF-8, and CGI escaped characters.

Example usage:

detox -r -v /path/to/your/files
-r Recurse into subdirectories
-v Be verbose about which files are being renamed 
-n Can be used for a dry run (only show what would be changed)
H. Hess
  • 851
63

One way would be with sed:

mv 'file' $(echo 'file' | sed -e 's/[^A-Za-z0-9._-]/_/g')

Replace file with your filename, of course. This will replace anything that isn't a letter, number, period, underscore, or dash with an underscore. You can add or remove characters to keep as you like, and/or change the replacement character to anything else, or nothing at all.

49

I assume you are on Linux box and the files were made on a Windows box. Linux uses UTF-8 as the character encoding for filenames, while Windows uses something else. I think this is the cause of the problem.

I would use "convmv". This is a tool that can convert filenames from one character encoding to another. For Western Europe one of these normally works:

convmv -r -f windows-1252 -t UTF-8 .
convmv -r -f ISO-8859-1 -t UTF-8 .
convmv -r -f cp-850 -t UTF-8 .

If you need to install it on a Debian based Linux you can do so by running:

sudo apt-get install convmv

It works for me every time and it does recover the original filename.

Source: LeaseWebLabs

mevdschee
  • 621
23

I assume you mean you want to traverse the filesystem and fix all such files?

Here's the way I'd do it

find /path/to/files -type f -print0 | \
perl -n0e '$new = $_; if($new =~ s/[^[:ascii:]]/_/g) {
  print("Renaming $_ to $new\n"); rename($_, $new);
}'

That would find all files with non-ascii characters and replace those characters with underscores (_). Use caution though, if a file with the new name already exists, it'll overwrite it. The script can be modified to check for such a case, but I didnt put that in to keep it simple.

phemmer
  • 6,060
18

Following answers at https://stackoverflow.com/questions/2124010/grep-regex-to-match-non-ascii-characters, You can use:

rename 's/[^\x00-\x7F]//g' *

where * matches the files you want to rename. If you want to do it over multiple directories, you can do something like:

find . -exec rename 's/[^\x00-\x7F]//g' "{}" \;

You can use the -n argument to rename to do a dry run, and see what would be changed, without changing it.

naught101
  • 923
7

This shell script sanitizes a directory recursively, to make files portable between Linux/Windows and FAT/NTFS/exFAT. It removes control characters, /:*?"<>\| and some reserved Windows names like COM0.

sanitize() {
  shopt -s extglob;

  filename=$(basename "$1")
  directory=$(dirname "$1")

  filename_clean=$(echo "$filename" | sed -e 's/[\\/:\*\?"<>\|\x01-\x1F\x7F]//g' -e 's/^\(nul\|prn\|con\|lpt[0-9]\|com[0-9]\|aux\)\(\.\|$\)//i' -e 's/^\.*$//' -e 's/^$/NONAME/')

  if (test "$filename" != "$filename_clean")
  then
    mv -v "$1" "$directory/$filename_clean"
  fi
}

export -f sanitize

sanitize_dir() {
  find "$1" -depth -exec bash -c 'sanitize "$0"' {} \;
}

sanitize_dir '/path/to/somewhere'

Linux is less restrictive in theory (/ and \0 are strictly forbidden in filenames) but in practice several characters interfere with bash commands (like *...) so they should also be avoided in filenames.

Great sources for file naming restrictions:

3

I use this one-liner to remove invalid characters in subtitle files:

for f in *.srt; do nf=$(echo "$f" |sed -e 's/[^A-Za-z0-9.-]/./g;s/\.\.\././g;s/\.\././g'); test "$f" != "$nf" && mv "$f" "$nf" && echo "$nf"; done
  1. Only process *.srt files( * could be used in place of *.srt to process every file)
  2. Removes all other characters except for letters A-Za-z, numbers 0-9, periods ".", and dash's "-"
  3. Removes possible double or triple periods
  4. Checks to see if the file name needs changing
  5. If true, it renames the file with the mv command, then outputs the changes it made with the echo command

It works to normalize directory names of movies:

for f in */; do nf=$(echo "$f" |sed -e 's/[^A-Za-z0-9.]/./g' -e 's/\.\.\././g' -e 's/\.\././g' -e 's/\.*$//'); test "$f" != "$nf" && mv "$f" "$nf" && echo "$nf"; done

Same steps as above but I added one more sed command to remove a period at the end of the directory

X-Men Days of Future Past (2014) [1080p]
Modified to:
X-Men.Days.of.Future.Past.2014.1080p

1

I know this is a bit old but recently I've discovered Google's translate-shell really helps with foreign named files with unicode-choking names. Helpful batch renaming with translation in shell.

$ echo скачать  | trans -b
download

https://github.com/soimort/translate-shell

[UPDATE] The Google Translate API tends to block you if you hit it too many times but I also found a convenient local option that converts between alphabets called uconv. Helpful phonetically but not translation:

echo скачать | uconv -x 'Any-Latin;Latin-ASCII'
skacat'
BoeroBoy
  • 156
1

This is loosely based on @KrisWebDev's search string.

  • don't touch files/dirs, create batch list instead (to review)
  • going via a two-stage temp file (is faster on my machine)
  • more edge cases for samba (trailing/leading spaces)
  • a basic progress indicator

note: there may occur "already exists" problems when doing the actual rename. to be solved manually

# tested on: bash linux
# needs: bc
# this function doesn't change files on its own
sanitize_dir() {
rm -f /tmp/filenames_toreview_$$.txt
touch /tmp/filenames_toreview_$$.txt

echo "
Batch mv review file is gonna be
/tmp/filenames_toreview_$$.txt
"

# find... and reverse list it, to prevent "file disappeared" (parent dirs are changed last)
find "$1" -depth | sort | tac >/tmp/filenames$$.txt

FOUNDNUM=$(cat /tmp/filenames$$.txt | wc | awk '{ print $1 }')
echo "# found $FOUNDNUM filenames or dirnames to check."
echo "# found $FOUNDNUM filenames or dirnames to check."  >> /tmp/filenames_toreview_$$.txt

IFS=$'\n'
shopt -s extglob;

COUNT=1
PROC_OLD=N

for THISLINE in $(cat /tmp/filenames$$.txt);do

    # Some percentage info
    PROC=$(printf %1.f $(echo "($COUNT/$FOUNDNUM)*100" | bc -l))

    if [ "$PROC" != "$PROC_OLD" ];then
        echo "# $PROC%"
        echo "# $PROC%" >> /tmp/filenames_toreview_$$.txt
        PROC_OLD=$PROC
    fi

    filename=$(basename "$THISLINE")
    directory=$(dirname "$THISLINE")

    filename_clean=$(echo "$filename" | sed -E -e 's/[\\/:\*\?"\|\x01-\x1F\x7F]//g' -e 's/^(nul|prn|con|lpt[0-9]|com[0-9]|aux)$/_\1/' -e 's/^$/NONAME/')

    # multi spaces => single spaces
    filename_clean=$(echo "$filename_clean" | sed -E -e 's/\s+/ /g' )

    # leading and trailing spaces
    filename_clean=$(echo "$filename_clean" | sed -E -e 's/^\s+//; s/\s+$//;' )

    if (test "$filename" != "$filename_clean")
    then
        echo "missmatch: '$filename' != '$filename_clean'"

        if [ -d "$THISLINE" ] || [ -f "$THISLINE" ];then

            echo mv -v "'$THISLINE'" "'$directory/$filename_clean'" >> /tmp/filenames_toreview_$$.txt

        else

            echo "File or dir disappeared. This shouldn't happen."

        fi
    fi
    COUNT=$((COUNT+1))

done
rm -f /tmp/filenames$$.txt

echo "

please review batch rename execution:
cat /tmp/filenames_toreview_$$.txt

"

}

sanitize_dir /goto/dir

Manu
  • 21
1

If you want to handle embedded newlines, multibyte characters, spaces, leading dashes, backslashes and spaces you are going to need something more robust, see this answer:
https://superuser.com/a/858671/365691

I put the script up on code.google.com if anyone is interested: r-n-f-bash-rename-script

Adam D.
  • 121
  • 4
-3

for file in *; do mv "$file" $(echo "$file" | sed -e 's/[^A-Za-z0-9.-]//g'); done &